Efficient and Consistent Transaction Processing in Wireless Data Broadcast Environments

Dissertation

zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Universitat¨ Konstanz Fachbereich Informatik und Informationswissenschaft

vorgelegt von

Andre´ Seifert

Begutachtet von

1. Referent: Prof. Dr. Marc H. Scholl, Universitat¨ Konstanz 2. Referent: Prof. Dr. Daniel A. Keim, Universitat¨ Konstanz

Tag der Einreichung: 05.01.2005 Tag der mundlichen¨ Prufung:¨ 27.04.2005 ii “Mit Worten verhalt¨ es sich wie mit Son-

nenstrahlen — je mehr man sie konden-

siert, um so tiefer dringen diese.”

– Robert Southey

Zusammenfassung

Die hybride, d.h., push- und pull-basierte Datenkommunikationsmethode wird sich wahrscheinlich als primarer¨ Ansatz fur¨ die Verteilung von Massendaten an große Benutzergruppen in mobilen

Umgebungen durchsetzen. Eine wesentliche Aufgabenstellung innerhalb hybrider Datenkommuni- kationsnetze ist es, Klienten einen konsistenten und aktuellen Blick auf die vom Server entweder uber¨ einen Breitband-Broadcastkanal oder mehrere dedizierte Schmallband-Unicastkanale¨ bereit- gestellten Daten zu geben. Ein ebenso wichtiges Forschungsgebiet innerhalb hybrider Datenkom- munikationssysteme stellt die Pufferverwaltung der mobilen Endgerate¨ dar, welche die Diskrepanz zwischen der Struktur und dem Inhalt des Broadcast-Programmes und den klienten-spezifischen

Informationsbedurfnissen¨ und Datenzugriffsmustern aufzulosen¨ versucht. Weiterhin kommt dem

Klientenpuffer die Aufgabe zu, die sequentielle Zugriffscharakteristik des Broadcastkanals weitge- hend zu verbergen und er kann daruberhinaus¨ als Speicherort verwendet werden, um veraltete — aber fur¨ den Klienten immer noch nutzliche¨ — Datenobjekte vorzuhalten, die demnachst¨ vom Ser- ver physikalisch geloscht¨ werden sollen oder dort bereits geloscht¨ worden sind.

Die vorliegende Dissertation stellt zunachst¨ verschiedene drahtlose Netzwerktypen vor, welche derzeit zur mobilen Datenkommunikation zur Verfugung¨ stehen und zeigt anschließend, daß die

Mehrheit heutiger drahtloser Netze uber¨ eine Asymmetrie in der Bandbreitenkapazitat,¨ dem Daten- volumen sowie der Servicelast verfugt.¨ Es wird aufgezeigt, daß die hybride Datenkommunikation, welche die traditionelle Pull- und die relativ neue Pushtechnik vereint, eine attraktive Kommunika-

iii iv Zusammenfassung tionsvariante zur Schaffung skalierbarer und flexibler mobiler Datendienste darstellt. Es folgt ein kurzer Uberblick¨ uber¨ die verschiedenen umweltbezogenen und systemimmanenten Einschrankun-¨ gen, welche mobile Computersysteme ausgesetzt sind und wir schlussfolgern daraufhin, daß es in mobilen drahtlosen Umgebungen wesentlich schwieriger als in traditionellen stationaren¨ Da- tenubertragungsnetzen¨ ist, gute Performanceergebnisse in Verbindung mit starken semantischen

Konsistenzgarantien fur¨ Transaktionen zu erreichen. Im gleichen Zuge werden mogliche¨ Techniken prasentiert,¨ um Datenkonflikte zwischen parallel laufenden Transaktionen zu vermeiden bzw. de- ren Anzahl zu verringern und es werden Moglichkeiten¨ vorgestellt, wie Datenkonflikte erkannt und aufgelost¨ werden konnen.¨ Wenn man hybride Datenkommunikationsnetze mit hoher Performance,

Skalierbarkeit und Verlasslichkeit¨ entwerfen und realisieren mochte¨ und daruber¨ hinaus auch noch strenge Anforderungen an die Datenkonsistenz und -aktualitat¨ des Systems stellt, mussen¨ — neben der Transaktionskontrolle — diverse andere performance- und missionskritische Aspekte betrachtet werden. Um dieser Forderung nachzukommen, beschaftigt¨ sich die Arbeit u.a. mit den Themen des

Broadcast-Schedulings und der Broadcast-Indexierung und es werden hierzu in der Literatur vorgeschlagene Ansatze¨ prasentiert¨ sowie evaluiert.

Im Anschluß an die Darlegung der praktischen Notwendigkeit und dem zunehmenden Interesse an der Gewahrleistung¨ einer zeitnahen und konsistenten Bereitstellung von Masseninformation uber¨ mobile Breitband-Broadcastkanale,¨ schließt sich eine Diskussion uber¨ die Herausforderungen und vielfaltigen¨ Probleme, welche hiermit verbunden sind, an. In diesem Zusammenhang wird behaup- tet, daß die momentan vorhandenen Definitionen von Isolationsgraden nicht fur¨ die Implementation von Transaktionsprotokollen, welche fur¨ Nur-Lese-Transaktionen kreiert werden, geeignet sind, da diese eventuell ungewollte — obwohl korrekte — Datenzugriffe aufgrund nicht vorhandener Da- tenaktualitatsgarantien¨ erlauben. Um diesem Problem Abhilfe zu schaffen, werden vier neue Isola- tionsgrade, welche zahlreiche nutzliche¨ Datenkonsistenz- und -aktualitatsgarantien¨ fur¨ Nur-Lese-

Transaktionen zur Verfugung¨ stellen, definiert und es werden geeignete Implementierungen dieser

Isolationsgrade fur¨ hybride Datenkommunikationsnetze prasentiert.¨ Um Performanceunterschiede zwischen den neu definierten Isolationsgraden bzw. Protokollen zu ermitteln, wurden zahlreiche empirische Experimente durchgefuhrt,¨ welche zeigen, daß der Strict Forward BOT View Consi- v stency Isolationsgrad und seine Implementation, welche die Bezeichnung MVCC-SFBVC tragt,¨ die besten Performanceergebnisse unter den verglichenen Transaktionsprotokollen erzielt.

Um die Antwortzeiten von mobilen Anwendungen zu verkurzen¨ und um eine hohe Skalierbar- keit von hybriden Datenkommunikationssystemen zu erreichen, spielt die Pufferverwaltung der mobilen Klienten (d.h. Endgeraten)¨ eine wesentliche, wenn nicht die entscheidende Rolle. Da existierende Pufferverwaltungsstrategien nur eine ungenugende¨ Unterstutzung¨ fur¨ Mehrversions-

Transaktionsprotokolle bieten, stellt diese Arbeit eine neue Pufferersetzungs- und -vorabrufstrategie vor, welche den Namen MICP tragt.¨ Das Acronym MICP steht dabei fur¨ Multi-version Inte- grated Caching und Prefetching und stellt eine hybride Pufferverwaltungsmethode dar, welche sowohl Datenseiten als auch Datenobjekte verwalten kann. Wahrend¨ Datenseiten nach dem tra- ditionellen LRU-Verfahren ersetzt werden, fuhrt¨ das MICP-Verfahren Objektersetzungs- und - vorabrufentscheidungen auf der Basis zahlreicher performance-kritischer Informationen durch, wo- zu u.a. die Aktualitat¨ und Haufigkeit¨ vorangegangener Objektzugriffe, die prognostizierte Ande-¨ rungswahrscheinlichkeit der gespeicherten Datenobjekte sowie deren Wiederbeschaffungskosten zahlen.¨ Um auf bestimmte speicherungsrelevante Ereignisse, wie z.B., daß bestimmte gespeicherte

Objektversionen fur¨ die Ausfuhrung¨ der momentan laufenden Transaktion(en) nutzlos geworden sind, reagieren zu konnen,¨ ist der MICP-Puffermanager eng an den Transaktionmanager gekop- pelt. Um zu vermeiden, daß nutzliche¨ — jedoch nicht-wiederbeschaffbare — Objektversionen mit wiederbeschaffbaren Objektversionen um verfugbare¨ Pufferressourcen konkurrieren mussen,¨ teilt das MICP-Verfahren den vorhandenen Speicherplatz des Klientenpuffers in zwei unterschiedlich große Segmente auf: die sogenannte REC- und NON-REC-Partition. Zur Beurteilung der Effizi- enz der MICP-Puffermanagementstrategie wurden umfangreiche Simulationsstudien durchgefuhrt,¨ welche zeigen, daß mobile Klienten, die nicht das MICP-Verfahren zur Ausfuhrung¨ von Nur-Lese-

Transaktionen einsetzen, einen durchschnittlichen Performanceverlust von etwa 19% erleiden.

Schließlich widmet sich die Arbeit dem Problem, Serialisierbarkeit von Lese-Schreib-

Transaktionen in Verbindung mit guten Antwortzeiten und einer niedrigen Transaktionsabbruchs- rate in hybriden drahtlosen Datenkommunikationsnetzen zu erreichen. Um diese Ziele zu verwirk- lichen, stellt die Arbeit eine Familie von funf¨ Mehrversions-Transaktionsprotokollen vor, die den vi Zusammenfassung

Namen MVCC-* tragt.¨ Die einzelnen Protokolle der MVCC-*-Familie unterscheiden sich dabei hinsichtlich der Scheduling-Performance, den Datenaktualitatsgarantien¨ , welche den Leseopera- tionen der Transaktionen zugesichert werden, sowie der Speicher- und Zeitkomplexitat¨ . Es wer- den die Performanceabweichungen zwischen den einzelnen Protokollen der MVCC-*-Familie, welche aufgrund unterschiedlicher Datenaktualitatsgarantien¨ und Schedulingentscheidungen ent- stehen, quantifiziert und außerdem werden die Performanceergebnisse mit denen, welche fur¨ das bekannte Snapshot Isolation Protokoll entstehen, verglichen. Da die MVCC-*-Protokollfamilie fur¨ Schedulingentscheidungen nur einfache Lese- und Schreiboperationen zur Laufzeit und keine semantischen Informationen uber¨ ihre zugrunde liegenden Transaktionen verwendet, skizziert und evaluiert die Arbeit diverse Moglichkeiten,¨ welche die vorgeschlagenen Transaktionsprotokolle er- weitern, um Datenkonflikte zu vermeiden bzw. zu reduzieren. Hierzu gehort¨ u.a. die Spezifikation von alternativen Schreiboperationen fur¨ ursprungliche¨ Anderungsoperationen,¨ die Reduktion der

Datengranularitat¨ auf welcher die Transaktionskontrolle basiert sowie die Erhohung¨ der Anzahl der vom System vorgehaltenen Versionen der Datenobjekte. Anschließend wird verdeutlicht, daß die MICP-Pufferverwaltungsstrategie auch dann dem LRFU-Verfahren bezuglich¨ der Pufferper- formance uberlegen¨ sein kann, wenn diese in Verbindung mit der Ausfuhrung¨ von Lese-Schreib-

Transaktionen eingesetzt wird. “It is with words as with sunbeams — the

more they are condensed, the deeper they

burn.”

– Robert Southey

Abstract

Hybrid, i.e., push and pull-based, data delivery is likely to become a method of choice for the dis- tribution of information to a large user population in many new mobile and stationary applications.

One of the major issues in hybrid data delivery networks is to provide clients with a consistent and current view on the data delivered by the server through a broadband broadcast channel and a set of dedicated narrowband unicast channels while minimizing the users’ response times. Another com- plementary problem in hybrid data delivery is the caching policy of the clients that needs to solve the mismatch between the server’s broadcast schedule and the individual user’s access pattern, to compensate for the sequential access characteristics of the air-cache, and to function as a last-resort source for non-current object versions that have been physically evicted from the server storage facilities.

In this doctoral thesis, we first discuss the various wireless network types currently available to provide data communication services and show that the majority of them exhibit asymmetry in the bandwidth capacity (i.e., those networks which have a significantly higher bandwidth available from servers to clients than in the reverse direction), data volume, and service load. We then argue that hybrid data delivery which integrates the traditional pull and the rather novel push techniques is an attractive communication mode to create highly scalable and flexible data services. A brief overview of the environmental and system-immanent constraints of mobile computing systems fol- lows and we reason that providing good system throughput results along with strong semantic guar-

vii viii Abstract antees for transactions is more challenging in mobile, portable environments than in conventional

fixed networks. In the same vein, we present possible techniques to avoid and reduce the number of data conflicts that may arise and discuss ways to detect and more importantly, to resolve them once identified. To design and deploy high performance, scalable, and reliable hybrid delivery net- works which provide strong semantic guarantees w.r.t. data consistency and currency to its users, various other performance- and mission-critical issues besides concurrency control (CC) needs to be addressed. To take account of that, we present and evaluate several strategies proposed in the literature on major topics such as broadcast scheduling or broadcast channel indexing not being covered in separate chapters in later parts of the thesis.

After motivating the practical need for timely and consistent data delivery to thousands of in- formation consumers, we discuss the challenges and problems involved in supporting appropri- ate consistency and currency guarantees to dissemination-based applications. We then argue that current definitions of isolation levels (ILs) are inappropriate for implementations of CC protocols suitable for read-only transactions as they allow unwanted, though consistent, data access patterns due to lack of data currency guarantees. To rectify the problem, we define four new ILs provid- ing various useful data consistency and currency guarantees to read-only transactions and present suitable implementations of the proposed ILs for hybrid data delivery networks. To evaluate the performance trade-offs among the newly defined ILs and protocols, respectively, extensive numer- ical experiments are done, demonstrating that the Strict Forward BOT View Consistency level and its implementation, termed MVCC-SFBVC, provides the best possible performance results among the CC protocols studied.

To shorten the response times and to achieve high scalability, client caching is one, if not the most fundamental technique in order to achieve these goals. In this thesis, we introduce a novel client cache replacement and prefetching strategy, called MICP, that makes eviction and prefetching decisions sensitive to various performance-critical factors including the objects’ access recency and frequency in the recent past, their update likelihood in the near future, their re-acquisition costs, etc.

On top of that, MICP is tightly-coupled to the transaction manager in order to obtain and instantly react on information indicating that locally stored non-current object versions have become useless ix from the CC perspective and can therefore be evicted from the client cache. To prevent that useful, but non-re-cacheable, object versions compete with re-cacheable object versions for available cache resources, MICP logically divides the cache into two variable-sized segments, dubbed REC and

NON-REC. To evaluate MICP’s cache management efficiency, we report on extensive experiments showing that MICP is able to improve transaction throughput results achieved by state-of-the-art online cache replacement and prefetching policies by about 19% if used for executing read-only transactions.

Finally, we consider the challenging problem of providing serializability along with good per- formance and strong semantic data currency and consistency guarantees to mobile applications issuing read-write transactions. To achieve this goal, we present a suite of five multi-version con- currency control (MVCC) protocols, denoted MVCC-*, that differ from each other in terms of their scheduling performance, data currency guarantees, and space and time complexity. We quantify the performance deviations among the protocols of the MVCC-* suite due to applying different read rules for servicing read requests and, additionally, compare their results with those measured for the well-known Snapshot Isolation scheme. As the MVCC-* suite is based only on analyzing read and write operations at runtime and does not exploit semantic information about their constituent transactions, we outline and partially evaluate (by means of simulation) possibilities of extending the proposed protocols by conflict reducing and avoiding measures such as specifying alternative write operations, reducing the data granularity at which CC is applied, increasing the number of versions managed in the system, etc. Last, but not least, we provide evidence that MICP is superior to LRFU, being the best cache replacement policy known so far, if used to improve the response times of read-write transactions. x Abstract “If you wish your merit to be known, ac-

knowledge that of other people.”

– Oriental Proverb

Acknowledgments

First of all, I’m indebted to my dissertation supervisor Marc H. Scholl who — using patience, motivation, ingenuity, and a laissez-faire management style — managed to get me through to my

PhD. Thanks go to my parents Christine and Frank and my sister Bianca for their unflagging support throughout the many years of my academic work and their warm and cosy welcome whenever visiting them during holidays or on other occasions. My thanks go also to my current and ex-

flatmates Denise, Lisa, Stefan, and Tilman, who made living and spending leisure time in Konstanz a pleasure to me. Finally, thanks to the other members of the Database Research Group and the anonymous referees of my submitted research papers paving the way to this dissertation for their insightful, critical, and valuable comments on my research work. A special thanks goes to my PhD colleague Svetlana for proof-reading large parts of the dissertation manuscript.

xi xii Acknowledgments Contents

Zusammenfassung iii

Abstract vii

Acknowledgments xi

List of Figures xix

List of Tables xx

List of Algorithms xxi

List of Acronyms xxii

List of Symbols xxviii

1 Introduction 1

1.1 Problem Statement ...... 1

1.2 Contribution ...... 10

1.3 Publications ...... 13

1.4 Outline ...... 14

2 Background 17

2.1 Basics of Wireless Communication Systems ...... 18

2.2 Wireless Network Types ...... 20

xiii xiv Contents

2.3 Limitations of Mobile Computing ...... 30

2.3.1 Techniques to Avoid or Reduce Reconciliation Conflicts ...... 32

2.3.2 Techniques to Detect and Resolve Reconciliation Conflicts ...... 35

3 Hybrid Data Delivery 37

3.1 Why to Use Hybrid Data Delivery ...... 37

3.2 Hybrid Data Delivery Networks ...... 39

3.2.1 Organizing the Broadcast Program ...... 41

3.2.2 Indexing the Broadcast Program ...... 47

4 Processing Read-only Transactions Efficiently and Correctly 61

4.1 Introduction ...... 61

4.1.1 Motivation ...... 62

4.1.2 Contribution and Outline ...... 64

4.2 Preliminaries ...... 64

4.3 New Isolation Levels Suitable for Read-only Transactions ...... 70

4.3.1 Why Serializability may be Insufficient ...... 70

4.3.2 BOT Serializability ...... 71

4.3.3 Strict Forward BOT Serializability ...... 76

4.3.4 Update Serializability ...... 82

4.3.5 Strict Forward BOT Update Serializability ...... 83

4.3.6 View Consistency ...... 89

4.4 Implementation Issues ...... 94

4.4.1 MVCC-BS ...... 97

4.4.2 MVCC-SFBS ...... 101

4.4.3 MVCC-SFBUS ...... 101

4.4.4 MVCC-SFBVC ...... 103

4.5 Performance Results ...... 104

4.5.1 System Model ...... 104 Contents xv

4.5.2 Workload Model ...... 108

4.5.3 Experimental Results of the Proposed CC Protocols ...... 110

4.5.4 Comparison to Existing CC Protocols ...... 114

4.6 Conclusion and Summary ...... 119

5 Client Caching and Prefetching Strategies to Accelerate Read-only Transactions 121

5.1 Introduction and Motivation ...... 121

5.1.1 Multi-Version Client Caching ...... 122

5.1.2 Multi-Version Client Prefetching ...... 124

5.1.3 Outline ...... 125

5.2 System Design and General Assumptions ...... 125

5.2.1 Data Delivery Model ...... 126

5.2.2 Client and Server Cache Management ...... 127

5.2.2.1 Data Versioning ...... 129

5.2.2.2 Client Cache Synchronization ...... 129

5.3 MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm .... 130

5.3.1 PCC: A Probabilistic Cost-based Caching Algorithm ...... 130

5.3.2 PCP: A Probabilistic Cost-based Prefetching Algorithm ...... 135

5.3.3 Maintaining Historical Reference Information ...... 138

5.3.4 Implementation and Performance Issues ...... 141

5.4 Performance Evaluation ...... 142

5.4.1 System Model ...... 142

5.4.2 Workload Model ...... 145

5.4.3 Other Replacement Policies Studied ...... 147

5.4.4 Basic Experimental Results ...... 149

5.4.5 Additional Experiments ...... 150

5.4.5.1 Effects of the Version Management Policy on MICP-L ...... 152

5.4.5.2 Effects of the History Size on MICP-L ...... 153 xvi Contents

5.5 Conclusion ...... 155

6 Processing Read-Write Transactions Efficiently and Correctly 157

6.1 Introduction ...... 157

6.1.1 Motivation ...... 158

6.1.2 Contribution and Outline ...... 159

6.2 System Design and Assumptions ...... 160

6.2.1 Data Delivery Model ...... 160

6.2.2 Database and Transaction Model ...... 163

6.3 A New Suite of MVCC Protocols ...... 165

6.3.1 MVCC-BOT Scheme ...... 165

6.3.2 Optimizing the MVCC-BOT Scheme ...... 175

6.3.3 MVCC-IBOT Scheme ...... 183

6.3.4 Optimizing the MVCC-IBOT Scheme ...... 195

6.3.5 MVCC-EOT Scheme ...... 204

6.4 Performance-related Issues ...... 208

6.4.1 Caching ...... 208

6.4.2 Disconnections ...... 211

6.4.3 Conflict Reducing Techniques ...... 213

6.5 Performance Evaluation ...... 218

6.5.1 Simulator Model ...... 219

6.5.2 Workload Model ...... 223

6.5.3 Comparison with other CC Protocols ...... 223

6.5.4 Basic Experimental Results ...... 225

6.5.5 Results of the Sensitivity Analysis ...... 226

6.5.5.1 Effects of Varying the Data Contention Level ...... 227

6.5.5.2 Effects of Specifying Alternative Write Operations ...... 228

6.5.5.3 Effects of Intermittent Connectivity ...... 229 Contents xvii

6.5.5.4 Effects of Using Various Caching and Prefetching Policies .... 232

7 Conclusion and Future Work 237

7.1 Summary and Conclusion ...... 237

7.2 Future Work ...... 242

Bibliography 247 List of Figures

2.1 Demand-assign based multiple access techniques ...... 19

3.1 Various possible organization structures of the broadcast program...... 48

3.2 Example illustrating the signature comparison process...... 50

3.3 An example illustrating the Hashing A data access protocol ...... 54

3.4 Tree-indexed ...... 56

4.1 Multi-version serialization graph of MVH1...... 71

4.2 Multi-version serialization graph of MVH3...... 85 4.3 Organization structure of the broadcast program...... 96

4.4 An overview of the simulation model used to generate the performance statistics. . 111

4.5 Throughput achieved by the protocols implementing the newly defined ILs ..... 113

4.6 Wasted work performed by the protocols implementing the newly defined ILs ... 114

4.7 Protocols studied with their respective data consistency and currency guarantees. . 116

4.8 Throughout results of various CC protocols compared to MVCC-SFBVC ..... 117

4.9 Wasted work performed by various CC protocols compared to MVCC-SFBVC .. 118

5.1 An example illustrating the peculiarities of multi-version client caching...... 124

5.2 Organization of the client cache...... 128

5.3 Performance of MICP-L and its competitors under various transaction sizes .... 151

5.4 Client cache hit rate of MICP-L and its competitors under various transaction sizes 152

5.5 Performance of MICP-L and its competitors under various versioning strategies .. 153

5.6 Performance of MICP-L under various cache sizes when HCR is varied...... 154

xviii List of Figures xix

6.1 Structure of the broadcast program...... 162

6.2 Multi-version serialization graph of MVH4 and MVH5...... 167

6.3 Multi-version serialization graph of MVH6...... 176

6.4 Multi-version serializability graph of MVH8...... 195

6.5 Multi-version serialization graph of MVH9...... 212 6.6 Two level history showing lower and higher order operations of two transactions .. 217

6.7 Performance results of the MVCC-* suite and SI under various transaction sizes .. 226

6.8 Performance of MVCC-BOT and MVCC-IBOT and their optimized variants .... 227

6.9 Performance of various CC protocols by varying the number of data updates .... 228

6.10 Performance gain by providing alternative write operations ...... 229

6.11 Performance degradation when increasing the client disconnection probability – I . 230

6.12 Performance degradation when increasing the client disconnection probability – II . 231

6.13 Performance deviation between various client caching and prefetching policies – I . 233

6.14 Performance deviation between various client caching and prefetching policies – II 233 List of Tables

2.1 Various characteristics of current and emerging wireless network technologies. .. 28

4.1 Newly defined ILs and their core characteristics...... 95

4.2 Summary of the system parameter settings – I (Read-only transaction experiments) 107

4.3 Summary of the system parameter settings – II (Read-only transaction experiments) 109

4.4 Summary of the workload parameter settings (Read-only transaction experiments) . 111

5.1 Summary of the system parameter settings – I ...... 144

5.2 Summary of the system parameter settings – II ...... 146

5.3 Summary of the workload parameter settings ...... 147

6.1 Definitions of possible conflicts between transactions...... 164

6.2 Summary of the system parameter settings – I (Read-write transaction experiments) 220

6.3 Summary of the system parameter settings – II (Read-write transaction experiments) 222

6.4 Summary of the workload parameter settings (Read-write transaction experiments) 224

6.5 The MVCC-* suite at a glance...... 236

xx List of Algorithms

3.1 Multi-disk broadcast generation algorithm ...... 45

3.2 Access protocol for retrieving data objects by using the integrated signature scheme. 51

3.3 Data access protocol of the Hashing A scheme ...... 53

3.4 Access protocol for retrieving data objects by using the (1,m) indexing scheme ... 57

4.1 Algorithm used by MVCC-SFBS to map read operations to object version reads .. 101

4.2 Algorithm used by MVCC-SFBUS to map read operations to object version reads . 103

5.1 Probabilistic Cost-based Caching (PCC) Algorithm ...... 136

5.1 Probabilistic Cost-based Caching (PCC) Algorithm (cont’d) ...... 137

5.2 Probabilistic Cost-based Prefetching (PCP) Algorithm ...... 139

6.1 MVCC-BOT’s scheduling algorithm ...... 168

6.2 CCR processing and transaction validation under MVCC-BOT ...... 168

6.3 CCR processing and transaction validation under MVCC-BOTO ...... 177

6.4 MVCC-BOTO’s scheduling algorithm ...... 178

6.5 CCR processing and transaction validation under MVCC-IBOT ...... 185

6.6 MVCC-IBOT’s scheduling algorithm ...... 186

6.7 CCR processing and transaction validation under MVCC-IBOTO ...... 196

6.8 MVCC-IBOTO’s scheduling algorithm ...... 197

6.9 CCR processing and transaction validation under MVCC-EOT ...... 205

6.10 MVCC-EOT’s scheduling algorithm ...... 206

xxi List of Acronyms

2G Second Generation

2.5G Second-and-a-half Generation

3G Third Generation

4G Fourth Generation

ACL Asynchronous Connection-oriented

ADAT Air-Cache Data Access Time

ALOHA The earliest packet network developed at the University of Hawaii at

Monoa

AIPT Air-Cache Index Probe Time

ATT Air-Cache Tuning Time

AWT Air-Cache Wait Time

BID Bucket Identifier

BOT Begin-Of-Transaction

CC Concurrency Control

CCR Concurrency Control Report

CCSize Client Cache Size

CD-SFR-SQ-MVSG Causal Dependency Strict Forward Read Single Query Multi-Version Se-

rialization Graph

CDMA Code Division Multiple Access

CPU Central Processing Unit

CRF Combined Recency and Frequency Value

CRM Customer Relationship Management

xxii xxiii

CSMA Carrier Sense Multiple Access

DBSize Database Size

DBS Direct Broadcast Satellite

EDGE Enhanced Data Rate for GSM Evolution

EOT End-Of-Transaction

ERP Enterprise Resource Planning

FCV Forward Consistent View

FDD Frequency Division Duplexing

FDMA Frequency Division Multiple Access

FIFO First-In-First-Out

FIR Fast Infrared

GEO Geostationary Orbit

GPRS General Packet Service

GPS Geographical Positioning Systems

GSM Global System Mobile

HCR History Size / Cache Size Ratio

HSCSD High Speed Circuit Switched Data

IBOT In-Between-Of-Transaction

ID Identifier

IEEE Institute of Electric and Electronic Engineers

IL Isolation Level

IR Infrared

IrDA Infrared Data Association

ISDN Integrated Services Digital Network

IS Index Segment

IS-136 Interim Standard 136

IS-95 Interim Standard 95

IS-95B Second Generation of the IS-95 Standard xxiv List of Acronyms

ISM Industrial, Scientific and Medical

KIWI Kill It With Iron

LEO Low-altitude Earth Orbits

LFU Least Frequently Used

LOS Line-of-Sight

LRFU Least Recently Frequently Used

LRFU-P Prefetch-based Variant of the LRFU Algorithm

LRU Least Recently Used

MAC Media Access Control

MBC Major Broadcast Cycle

MIBC Minor Broadcast Cycle

MIPS Million Instructions Per Second

MICP Multi-Version Integrated Caching and Prefetching Algorithm

MICP-L Lightweight Multi-Version Integrated Caching and Prefetching Algorithm

MEO Medium-altitude Earth Orbits

MVSG Multi-Version Serialization Graph

MVCC Multi-Version Concurrency Control

MVCC-BOT Multi-Version Concurrency Control Protocol with BOT data currency

guarantees

MVCC-BOTO Optimized Multi-Version Concurrency Control Protocol with BOT data

currency guarantees

MVCC-BS Multi-Version Concurrency Control Protocol with BOT Serializability

Guarantees

MVCC-EOT Multi-Version Concurrency Control Protocol with EOT data currency guar-

antees

MVCC-IBOT Multi-Version Concurrency Control Protocol with IBOT data currency

guarantees xxv

MVCC-IBOTO Optimized Multi-Version Concurrency Control Protocol with IBOT data

currency guarantees

MVCC-SFBS Multi-Version Concurrency Control Protocol with Strict Forward BOT Se-

rializability Guarantees

MVCC-SFBUS Multi-Version Concurrency Control Protocol with Strict Forward BOT Up-

date Serializability Guarantees

MVCC-SFBVC Multi-Version Concurrency Control Protocol with Strict Forward BOT

View Consistency Guarantees

MIR Medium Infrared

MMDS Multi-channel Multi-point Distribution Service

MMS Multi-Media Messaging

MOB Modified Object Buffer

NLOS Non-Line-of-Sight

OBSize Object Size

OCC Optimistic Concurrency Control

OCSize Client Object Cache Size

OID Object Identifier

OIL Object Invalidation List

OFDM Orthogonal Frequency Division Multiplexing

OVRPL Object Version Read Prohibition List

OVWPL Object Version Write Prohibition List

P Probability-based Cache Replacement Algorithm

P-P Prefetch-based Variant of the P Algorithm

PDC Pacific Digital Cellular

PGSize Page Size

PLE Prohibition List Entry

P-P Prefetch-based Variant of the P Algorithm

PSTN Public Switched Telephone Network xxvi List of Acronyms

QoS Quality of Service

RC Read Consistency

RF Radio Frequency

RFP Read Forward Phase

RFF Read Forward Flag

RFSTS Read Forward Stop Timestamp

RLC Radio Link Control

RPP Relative Performance Penalty

RTT Round-Trip Time

SBSize Server Buffer Size

SCO Synchronous Connection-oriented

SI Snapshot Isolation

SIR Serial Infrared

SMS Short Message Service

SPOT Smart Personal Objects Technology

ST-MVSG Start Time Multi-Version Serialization Graph

SFR-MVSG Strict Forward Read Multi-Version Serialization Graph

SFR-SQ-MVSG Strict Forward Read Single Query Multi-Version Serialization Graph

STS Start Timestamp

TID Transaction Identifier

TDD Time Division Duplexing

TDMA Time Division Multiple Access

TOB Temporary Object Cache

US Update Serializability

VFIR Very Fast Infrared

W2R Cache Replacement and Prefetching Algorithm

W2R-B Broadcast-based Variant of the W2R Algorithm

WPAN Wireless Personal Area Networks xxvii

WLAN Wireless Local Area Network

WMAN Wireless Metropolitan Area Network

WWAN Wireless Wide Area Network List of Symbols

BDi Broadcast disk i

Bcurr Bucket currently being broadcast

CPre,i Pre-condition of transaction Ti

CPost,i Post-condition of transaction Ti

Chit Average client cache hit rate

CSi Client object cache segment i

CT(Ti) Set of transactions that conflict with Ti CTS(x) Commit timestamp of data object or data page x

CTSOIL(x) Commit timestamp of object x maintained in the object invalidation list

CRFn(x) Combined recency and frequency value of object x over the last n refer- ences

D Database

DS Database state

DSRFSTS(Ti) Database state as it existed at RFSTS(Ti)

DSEOT(Ti) Database state as it existed at EOT(Ti) E Edge of a multi-version serialization graph

FVM(Ti) Final validation message of transaction Ti

IDCCR,c Identifier of the current CCR

IDCCR,l Identifier of the latest CCR received by the client

IDre f ,c Reference identifier associated with the current object reference

IDre f ,l(x) Reference identifier assigned to object x when it was last accessed

IDupdate,c Update identifier associated with the current object update

xxviii xxix

IDupdate,l(x) Update identifier assigned to object x when it was last accessed

ISi, j j-th index segment of data channel i

LMBC Average MBC length N Node of a multi-version serialization graph

Nver(xi) Number of versions of object x with CTSs equal to or older than xi currently being kept by the server

Nop, j Number of operations executed so far by transaction Tj

Nob j Number of objects for which the client retains historical reference infor- mation

PT(Ti) Set of transactions that precede Ti in any valid serialization order

ReadSet(Ti) Read set of the read-only or read-write transaction Ti

RCt (xi) Re-acquisition costs of object version xi at time t

RFF(Ti) Read forward flag of transaction Ti

RFSTS(Ti) Read forward stop timestamp of transaction Ti

Sbcast Signature file containing the signatures of all objects scheduled for broad- casting

Si Signature of object i

Si, j j-th atomic statement of transaction Ti

Smem Client memory size

Squery Query signature

ST(Ti) Set of transactions that succeed Ti in any valid serialization order

STS(Ti) Start timestamp of transaction Ti

Tactive(Ti) Set of read-write transactions that were active during Ti execution time, but

committed before Ti

T(xi) Estimated weighted time to service a fetch request for object version xi

Ti Read-only or read-write transaction with the index i

data Ti Time when the i-th data bucket is periodicially broadcast in the broadcast cycle xxx List of Symbols

index Ti, j Time at which index bucket at position p(i, j) in the broadcast cycle is periodically broadcast

Thit (xi) Amount of time it takes to service a request for object version xi

Tmiss(xi) Weighted approximation of the amount of time it takes the client to restore

the current state of some active transaction Tj in case Tj has to be aborted

due to a fetch miss of xi

Tre−exe, j Estimated time is takes to restore the current state of an active read-only

transaction Tj in case Tj has to be aborted due to a fetch miss

Ts Time at which the server begins broadcasting

Tupdate Set of read-write transactions

TSC(CCRlast ) Timestamp of the last CCR that has been successfully processed by client C

UPn(x) Update probability of object x based on its n previous updates

WriteSet(Ti Write set of read-write transaction Ti α Aging factor bucket id(B) Function returning the BID of bucket B

dindex Number of times the index segment is broadcast per MBC

dom(xi) Domain of data object i f Version function

h Hash function

h(k) Hash function returning the hash value of key k

height Height of the index tree

ji j-th index bucket at the i-th index level of the index tree k Hash key

lCCR Length of the CCR segment

ldata Length of the data segment

lindex Length of the index segment

lob ject Lenght of an object in terms of broadcast ticks level(i) i-th level of the index tree xxxi

nbcast Number of data, index, and CCR buckets contained in one broadcast cycle ndata Number of data buckets required to accommodate all data objects sched- uled for broadcasting nnon−index Number of non-index buckets disseminated between two consecutive index files ni Number of index buckets at i-th level of the index tree nCCR Number of CCR buckets reserved for CC-related information recorded dur- ing the last MIBC oMBC Offset to then next MBC pi Probability of accessing the i-th object of the database p(i, j) Position of the j-th index bucket of the i-th index level in the broadcast

cycle ri[x] Read operation on data object x by transaction Ti ri[x,v] Read operation on data object x by transaction Ti and the value read from x is v s Bucket shift value scurr Shift value of the bucket currently broadcast w Number of previous MBCs for which the broadcast server maintains a data

update history wi[x] Write operation on data object x by transaction Ti wi[x,v] Write operation on data object x by transaction Ti and the value written into x is v xxxii List of Symbols “A new scientific truth does not triumph

by convincing its opponents and making

them see the light, but rather because its

opponents eventually die, and a new gen-

eration grows up that is familiar with it.”

– Max Planck

Chapter 1

Introduction

This chapter briefly enumerates the reasons why we are concerned about providing well-defined data consistency and currency guarantees to read-only and read-write transactions along with good performance within the context of hybrid data delivery networks, lists the major contributions of the PhD thesis, and presents the dissertation’s outline.

1.1 Problem Statement

Technological advances in the computer, software, and communications industry over the last two decades have provided users with the possibility to access, produce, and update information without the constraint of being connected to a fixed (wireline) network. Nowadays, mobile users experience instant access to various information-dissemination and business-enabling services nearly anytime, and anywhere on the globe. The rapid evolution of mobile computing is fostered by the follow- ing trends and already existing or fourth-coming applications: (a) Due to higher global industrial and agricultural productivity along with falling average annual working hours [56], people’s leisure

1 2 Chapter 1. Introduction time has increased being partially filled by intensive application of wireless infotainment applica- tions such as Short Message Services (SMS), Multi-Media Messaging (MMS), wireless games, etc.

(b) In today’s financial and economical world, a lot of business happens outside the corporate walls at trade fairs, exhibitions, event centers, customer’s premises and the like. In preparation for and during business meetings remote wireless access to corporate and non-corporate data via mobile handheld devices is highly desirable and useful in such occasions. (c) Besides the widening appeal of mobile computing for the industry, a new class of applications, called location-based services

(LBS), attracts the attention of the public and will certainly contribute to the excitement and flour- ishment of mobile computing in the years to come. Service examples promoting the use of LBSs are finding a hassle-free route through jammed roads, booking a hotel with room rates lower than

$100 within a 10 km radius of the current position or getting the closest parking space available. As location-based information (free rooms and parking lots in a city, travel information and devices for a region, etc.) are typically of interest to a larger user population, data dissemination technology is often applied in order to deliver the information to the consumers. (d) Besides the application areas named above, mobile computing will be promoted by the migration of conventional application domains onto the wireless platform. Prominent examples include enterprise applications such as

Enterprise Resource Planning (ERP), Sales Force Automation or Customer Relationship Manage- ment (CRM), financial services like online banking and stock trading, or E-Commerce systems to name just a few.

This new, fast evolving computing and communications environment presents several chal- lenges to the deployment of mobile data services such as bursty and unpredictable work- loads [51, 97], data volume asymmetry, service load asymmetry, network bandwidth asymme- try [3], etc. To address these challenges, different channel operators (e.g., Hughes Network [70] or StarBand [79]) have started providing new types of information services, namely broadcast or dissemination-based services, having the potential to adapt and scale to bursty and unexpected user demands, and to exploit the asymmetric bandwidth property prevalent in many mobile net- works (see Section 2.2 for more details). In particular, the recent announcement of the Smart

Personal Objects Technology (SPOT) by Microsoft [108] once more highlights the industrial in- 1.1. Problem Statement 3 terest in and feasibility of utilizing data broadcasting for wireless data services. In the firm belief that dissemination-based database systems become as widespread as traditional distributed database systems, they have to provide the same level of database support as traditional OLTP systems. Thus, an interesting challenge, and at the same time the main theme of this thesis, is to study the issue of providing transaction support in the framework of a dissemination-based database system [74,122].

Traditional databases use transactions to ensure that programs transform the system from one consistent state to another in spite of concurrency and failures. The most common correctness cri- terion adopted by traditional transactions is based on the notion of serializability which guarantees that even though transactions run concurrently, it seems to the user as if they execute in serial order.

We believe that in mobile data broadcasting environments, even though transaction processing may be hindered by the various limitations immanent to mobile systems such as frequent disconnection, limited battery life, low-bandwidth communication (at least in the direction from the client to the server), reduced storage capacity, etc., transactions that modify the database (i.e., read-write trans- actions) should, in general, not execute at consistency levels below serializability. Even though weaker consistency levels than serializability have the potential to support transactions more ef-

ficiently, they, however, have the important drawback that the application programmer needs to be fully aware of conflicts of its transactions with other transactions in the system, and needs to specify compatibility sets [49, 52] (which group together transactions that can freely interleave),

(state-independent or state-dependent) commutativity tables [111, 160], or pre- and postconditions of atomic operation steps of the constituent transactions [11,25] in order to ensure that data consis- tency is never violated. As information such as commutativity of operations or interstep assertions cannot be determined automatically and are all but trivial to specify, semantics-based CC compli- cates application programming greatly and is inherently error-prone. Thus, we initially chose to abstract from the semantic details of the operations of each transaction, and concentrate only on the sequence of read and write operations that result from the transaction execution. By follow- ing this approach, we are able to devise CC algorithms that are straightforward to implement and provide transaction support for any dissemination-based application without the need to analyze its inherent semantics. Clearly, by using the simple read/write model [161] to facilitate CC, we may 4 Chapter 1. Introduction lose a great deal of potential to enhance concurrency and ultimately performance, however, we par- tially compensate for this by employing two well-known performance improving techniques: (a) multi-versioning and (b) fine-grained object-level rather than coarse-grained page-level CC.

To see how both concepts may contribute to the goal of improving concurrency in a dissemination-based system, consider the following two examples:

Example 1.

Suppose that a mono-version database contains three objects (i.e., bank accounts) x, y, and z such that x = 0, y = 0, and z = 0. Suppose further that two invariants need to be maintained by the transactions, namely x + y = 0 and x + y = z. Now suppose that transactions T1 and T2 are executed concurrently by the scheduler and the following history is produced:1

H1 = r1[x,0] r2[x,0] r2[y,0] w2[x,−10] w2[y,10] w2[z,0] c2 r1[y,10] w1[z,10] a1

In this mono-version history, transaction T1 reads objects x and y before modifying the value of z and transaction T2 transfers money ($10) from account x to y and subsequently updates summary account z. In history H1 transaction T1 observes an inconsistent view of the database since it sees the partial effects of T2 and, consequently, T1 needs to be aborted by the mono-version scheduler in order to maintain database consistency.

Now let us consider a multi-version system. In this case, after transaction T2 is committed, there are two distinct snapshots in the database: {x0 = 0, y0 = 0, z0 = 0} and {x2 = −10, y2 = 10, z2 = 0}.

The first snapshot is in the past and is consistent with the current read set of T1. Thus, T1 can now be serialized on this snapshot by reading the old version of object y and the resulting history is as follows:

H2 = r1[x0,0] r2[x0,0] r2[y0,0] w2[x2,−10] w2[y2,10] w2[z2,0] c2 r1[y0,0] w1[z1,0] c1

Note that in contrast to history H1, H2 is serializable in the order T1 < T2 and the database system selects the version order x0  x2,y0  y2,z0  z1  z2. 

Example 2.

Another way to improve the concurrency of a shared resource is to implement CC on a fine (record-

1 Note that as H1 is a mono-version history, we ignore the version subscripts on objects x, y, and z, respectively. 1.1. Problem Statement 5 or object-level) rather than a coarse (page- or table-level) granularity level. To illustrate the differ- ence between both approaches, we use the same example as above and assume again a multi-version system. However, and in contrast to the example discussed above, the scheduler does not any more maintain information on the read and write operations of transactions at the object-level, but rather at the page-level. Now let us assume that object x is stored on page p and object y and z are contained in page q. Then, history H2 changes as follows:

H3 = r1[p0,0] r2[p0,0] r2[q0,0] w2[p2,−10] w2[q2,10] w2[q2,0] c2 r1[q0,10] w1[q1,10] c1

The resulting history H3 is not any more serializable since for any version order of p and q, there is a cycle in the corresponding multi-version serialization graph. The example nicely illustrates the occurrence of so-called false or permissable conflicts that occur due to information grouping. To eliminate most of those conflicts, we consider CC based on objects rather than pages. 

Allowing multiple versions of the same object to exist in the database has not only implications on the system’s scheduling power and data storage requirements, but also on the currency of the data provided to the users. For example in history H2, when transaction T1 tries to read object y, it can either be provided with the latest version of y or an older version of y (i.e., the second latest version). While choosing the first alternative would produce a non-serializable history, us- ing the second option results in the serializable history H2. The example clearly illustrates that multi-versioning does not only influence the system’s scheduling flexibility, but also the currency of the data provided to the application users. Additionally, the example provides evidence that serializability alone is insufficient to ensure that transactions see up-to-date data. Note, however, that data freshness is the most important requirement after data consistency to be satisfied by any transaction issued in a multi-version dissemination-based environment. The reason is that in real- world information-centered applications such as stock trading and monitoring systems, sport and traffic information systems, etc., the values of the data objects change frequently and are highly dynamic [41], thus for the objects to be meaningful to the users they must be up-to-date or at least

“close” to it [145,163]. Unfortunately, currently available definitions of isolation levels (ILs) either do not specify the degree of data currency they provide or those that do define it, do, however, not 6 Chapter 1. Introduction ensure serializability.

To rectify the problem and to take account of the fact that the majority of the transactions initi- ated in broadcast-based environments is of the read-only type [124] (and any read-only transaction is easily serializable with all the other committed read-only and read-write transactions in the sys- tem by letting it read from a consistent, but not necessarily current database snapshot), we define four new ILs that provide useful data consistency and currency guarantees to read-only transactions.

In contrast to read-write transactions, which modify the state of the database, read-only transactions may not necessarily need so strong data consistency guarantees such as serializability, thus clients may want to allow read-only transactions to be executed under slightly weaker ILs which, how- ever, still guarantee them to observe a transaction-consistent database state. Note that in order to observe a transaction-consistent database, a read-only transaction must not see the partial effects of any update transactions, but rather for each update transaction, it must see either all or none of its effects. Two of our newly defined ILs, namely Strict Forward BOT Update Serializability and

Strict Forward BOT View Consistency, provide such weaker consistency guarantees along which

firm data currency guarantees (see Section 4.3.2 for information on the various types of data cur- rency). As our ILs are defined in an implementation-independent manner by using a combination of conditions on serialization graphs and transaction histories, they allow both pessimistic (lock- ing and non-locking) and optimistic (non-locking) CC implementations. To implement the newly defined ILs, we opted for multi-version (timestamp-based) optimistic CC schemes since they are appealing in distributed dissemination-based client-server systems as they allow clients to execute and commit read-only transactions locally using cached information without any extra communi- cation with the broadcast server. Pessimistic schemes, however, may require such communication when, for example, an object has to be accessed or modified. To efficiently implement the vari- ous data consistency and currency guarantees of our newly defined ILs, we take advantage of the system’s communication structure and periodically broadcast CC-related information to the client population at nearly no extra costs and also utilize the “read forward” property, if applicable, i.e., allow transactions to read “forward” beyond their starting points as long as their underlying data consistency property is not violated, to optimize our schemes. 1.1. Problem Statement 7

A fundamental prerequisite for any multi-version CC scheme to be effective is that the server or even better the client stores all those versions of data objects that are useful from the application point of view. Obviously, the closer the data is located to the CPU running the application program, the more responsive the application becomes. Therefore, the issue of client cache management is very critical to the overall system performance. While the basic idea behind client caching is very simple (i.e., keep the data objects with the highest utility of caching), in a multi-version storage space-constrained environment it requires the application of various client mechanisms to make it effective. The first issue is related to the garbage collection of old object versions that have become useless for active transactions. To prevent the situation that useless object versions compete with useful ones for scare client storage space, it is highly desirable that the transaction manager provides hints to the client cache manager once an object version becomes eligible for garbage collection.

Another, but much more complex issue is to decide which object version to replace, if the cache is full and space is required for a new, not yet cache-resident object version.2 Making cache re- placement decisions in multi-version broadcast-based environments significantly differs from those in mono-version traditional pull-based systems due to the following two reasons:

• In traditional pull-based systems the cost of a client cache miss (i.e., the requested object

is non-cache-resident) is the round trip time (RTT) of sending the data request to the server

and receiving the object itself through the network interface. If we ignore the variance in

network loads and the effects of caching and disk utilization at the server, the data fetch

latency can be assumed to be the same for all requested objects. In hybrid broadcast-based

systems, however, in which the majority of non-cache-resident data is downloaded from the

broadcast channel, the data access costs are not any more constant, but rather depend on the

current position of the respective objects in the broadcast cycle. As a consequence and in

contrast to traditional cache replacement policies such as Least Recently Used (LRU), Least

Frequently Used (LFU), or Least Recently/Frequently Used (LRFU) [98, 99], replacement

victims cannot be selected exclusively on the probability of their accesses in the future, but

2When a client cache slot is needed for a new object version being read in from the air-cache or broadcast server, the object version selected by the cache manager to free its slot is called the replacement victim. 8 Chapter 1. Introduction

also on their re-acquisition costs.

• Multi-version schemes are only effective in providing users with a transaction-consistent view

of the database, if the data to be observed by the clients is actually available in the system.

However, as storage space is not infinite and managing large numbers of versions can be

very costly, it is reasonable to assume that the broadcast server imposes an upper limit on

the number of versions that it keeps around simultaneously. Under such storage constraints

and the realistic assumption that objects are frequently modified [41], situations may occur

where useful object versions need to be evicted from the server since the upper bound on

the number of versions is exceeded. That is, in multi-version systems a local cache miss

may be accompanied with a global data miss resulting in an abort of the affected transac-

tion. Therefore, and in order to avoid/reduce the number of such transaction aborts induced

by data fetch misses, the client cache manager should use an exclusive portion of its avail-

able storage space to store object versions which are locally important and potentially useful

for some ongoing transaction, but which have been evicted from the server. Additionally,

when choosing a replacement victim from the set of non-re-cacheable object versions, the

victim should be the object version which has the lowest local access probability and the

lowest re-acquisition costs. Since the object versions under consideration are not any more

re-cacheable, the re-acquisition costs correspond to the estimated amount of time it takes to

re-execute the transaction(s) for which the respective object version might be useful.

A third issue is about prefetching object versions in anticipation of their accesses in the future.

The intuition behind prefetching in a multi-version context is to pre-cache those object versions that are likely to be requested in the near future in order to avoid any delay when the access even- tually occurs. Prefetching is particularly appealing in dissemination-based environments because it does not place any additional load on shared resources (server disk(s), server processor(s), net- work, etc.) as in pull-based systems as data flows pass the client anyway and, therefore, impacts only the client resources. However, despite the much lower costs of prefetching in dissemination- based environments, the risk of excessive and imprudent prefetching still remains, i.e., scarce cache 1.1. Problem Statement 9 space can be wasted by pre-caching object versions that are not important or are prefetched too early. Additionally, the danger exists that an object version is replaced by another equally impor- tant object version and shortly after its eviction it becomes non-re-cacheable from the server. To prevent such situations from occurring, prefetching decisions should be based not only on the cost of re-acquisition and probability of access, but also on the likelihood that it will be re-cacheable in the next broadcast cycle. As traditional and recently designed client caching and prefetching algorithms [4, 5, 32, 83, 119, 151] do not adequately address the issues discussed above, we have developed a new multi-version integrated caching and prefetching policy, called MICP, which is presented along with an implementable approximation, termed MICP-L, in this thesis.

As we have shown by means of histories H1 and H2, multiple versions of data objects cannot only be exploited to improve the degree of concurrency between read-only and read-write transac- tions, but also between read-write transactions themselves. The higher concurrency can be realized by providing the scheduler with the flexibility to map so-called “out-of-order” read requests to appropriate, older versions of data and to select a version order for each object that need not nec- essarily correspond to the write or commit order of the transactions that created the object versions in the history, i.e., the scheduler may choose the version order x2  x1 even though T1 committed before T2. As result of this flexibility, however, the scheduler may produce undesirable anomalies caused by trading scheduling power for data freshness. For example in history H2, T1 observes the updates of transaction T0 (which is not shown in H2), but misses the effects of transaction T2.

Now suppose that the clients that execute transactions T1 and T2 communicate with each other or, alternatively, both transactions are executed at the same client: the client running transaction T1 may be confused about the (old) values read from objects x and y since it knows that both objects were recently modified by transaction T2. To prevent users of dissemination-based systems from experiencing such possibly undesirable read phenomena, CC protocols need to incorporate a pre- cise notion of data currency in their specifications, hence allowing them to initiate their read-write transactions with well-defined data currency guarantees. As existing broadcast-based multi-version concurrency control (MVCC) protocols providing serializability to read-write transactions do not explicitly specify the degree of currency they provide, we rectify this shortage by defining a suite of 10 Chapter 1. Introduction

five new MVCC schemes, denoted MVCC-*, that ensure different levels of data freshness to their users.

1.2 Contribution

In this thesis we broadly address the issue of providing good performance along with appropriate data consistency and currency guarantees to read-only and read-write transactions in a broadcast- based, bi-directional client-server data dissemination environment. The issues of transforming the database from one consistent state to another and providing a consistent view of the database in spite of concurrency have been investigated earlier in the context of centralized, distributed and mobile databases. However, while the basic problem being addressed is not new, the integration of precise notions of data currency into the definitions of ILs ensuring reliable data currency guarantees to read-only transactions is novel. We further contribute by defining various MVCC protocols that provide appropriate data consistency and currency guarantees to read-only and read-write trans- actions alike and provably outperform previously proposed CC schemes by significant margins.

Additionally, a novel cache replacement and prefetching policy is proposed that ideally supports the data requirements of our MVCC protocols.

In particular, this thesis makes the following contributions to the main research issues, namely read-only transaction management, client caching and prefetching, and read-write transaction man- agement, in the area of broadcast-based, wired and wireless data dissemination networks:

Read-only Transaction Management:

• Taking account of the fact that read-only transactions constitute the vast majority of the

transactions executed in dissemination-based systems [124] and that current definitions of

IL such as Conflict Serializability [120], Update Serializability [61,162] or External Consis-

tency/Update Consistency [29, 159] are not perfectly suitable for processing read-only trans-

actions as they lack any semantic guarantees as far as data currency is concerned, we specify

four new ILs dedicated to read-only transactions that provide useful data consistency and

currency guarantees to application programmers and users alike. Among the newly defined 1.2. Contribution 11

levels, Strict Forward BOT View Consistency is the weakest one which allows the highest

degree of transaction concurrency. Strict Forward BOT View Consistency ensures causally-

consistent reads and permits a read-only transaction Ti to observe the updates of transactions

that committed after Ti’s starting point as long as Ti sees a transaction-consistent database state.

• While the newly defined ILs have applicability to any type of transaction-based database sys-

tem integrated into any network environment, we present tailor-made CC protocols that sup-

port the different ILs in push/pull hybrid data delivery environments. Our protocols, namely

MVCC-BS, MVCC-SFBS, MVCC-SFBUS, and MVCC-SFBVC, are optimistic in nature and

allow clients to validate and subsequently to either abort or commit read-only transactions

without communicating with the server which obviously offloads work from it, thereby mak-

ing the system more scalable.

• We evaluate the performance of the implementations of our newly defined ILs via simula-

tions. To the best of our knowledge, this is the first simulation study investigating the per-

formance trade-offs of providing various levels of data consistency and currency to read-only

transaction in a mobile hybrid data delivery environment. The results show that implemen-

tations that provide Full Serializability such as MVCC-SFBS do not have high performance

penalties (about 1-10% depending on the average size of the read-only transactions) com-

pared to schemes that provide weaker guarantees such as View Consistency. We also con-

ducted a comparison study to examine our protocols’ performance with that of existing CC

schemes which were specifically designed for dissemination-based systems. The results show

that our best performing CC protocol, MVCC-SFBVC, outperforms the other examined pro-

tocols by significant margins. 12 Chapter 1. Introduction

Client Caching and Prefetching:

• Caching and prefetching data at clients is one, if not the most effective, technique to improve

the overall system performance of dissemination-based systems. As previously proposed

client cache replacement and prefetching strategies do not optimally support the data storage

and preservation requirements of clients using MVCC protocols to provide a current and

transaction-consistent view of the database regardless of concurrent read-write transactions

in the system, we introduce a new integrated caching and prefetching algorithm, called MICP.

MICP takes account of the dynamically changing cost/benefit values of client-cached and

air-cached object versions by making cache replacement and prefetching decisions sensitive

to the objects’ access probabilities, their re-acquisition costs, their update liklihood, etc. To

solve the problem that newly created or non-current, but re-cacheable, object versions replace

non-re-cacheable versions from the client cache, MICP divides the available cache space into

two variable-sized partitions, namely REC and NON-REC, for caching re-cacheable and

non-re-cacheable object versions, respectively.

• We compare MICP’s performance in terms of transaction throughput w.r.t. one offline and

two state-of-the-art online cache replacement and prefetching policies (P, and LRFU and

W2R, respectively) and the results show that MICP outperforms the currently best online

caching algorithm LRFU by about 19% and performs only 40% worse than the offline strat-

egy P, which keeps the objects with the highest probability of access in the cache.

Read-Write Transaction Management:

• Another contribution of this thesis is to formally define three new MVCC protocols that

provide serializability along well-defined data currency guarantees to read-write transaction

namely MVCC-BOT, MVCC-IBOT, and MVCC-EOT, to present optimizations of the first two

schemes, to proof their correctness, and to evaluate their performance among each other and

w.r.t. the well-known and frequently implemented Snapshot Isolation scheme. The motiva-

tion for proposing yet another suite of CC protocols is the observation that currently avail-

able protocols designed to manage read-write transactions do either not provide serializability 1.3. Publications 13

guarantees [121] or enforce serializability in a suboptimal way [102,110]. Furthermore, none

of those protocols that ensure serializability for read-write transactions explicitly specifies the

degree of data currency it provides.

• We discuss techniques that are applicable to identify and prevent false or permissable con-

flicts among those detected by our MVCC protocols and provide and evaluate ways to resolve

data conflicts without restarting the constituent read-write transaction. We also provide so-

lutions to make our protocols resilient to network failures and present experimental results

quantifying the influence of client disconnections from the network on the performance of

our protocols.

• We explain why our cache replacement and prefetching strategy MICP, initially designed for

exclusive use in applications issuing read-only transactions, can be utilized to service read-

write transactions, too. Performance results demonstrate that the performance penalty of

deploying LRFU instead of MICP when running read-write transactions is about 6% on the

average.

1.3 Publications

The main contributions on which the thesis is based have already been the subject of technical reports, conference, and journal papers:

• A. Seifert and M. H. Scholl. Processing Read-only Transactions in Hybrid Data Delivery

Environments with Consistency and Currency Guarantees. Tech. Rep. 163, University of

Konstanz, Dec. 2001.

• A. Seifert and M. H. Scholl. A Transaction-Conscious Multi-version Cache Replacement and

Prefetching Policy for Hybrid Data Delivery Environments. Tech. Rep. 165, University of

Konstanz, Feb. 2002.

• A. Seifert and M. H. Scholl. A Multi-Version Cache Replacement and Prefetching Policy for

Hybrid Data Delivery Environments. VLDB 2002, pp. 850-861, Aug. 2002. 14 Chapter 1. Introduction

• A. Seifert and M. H. Scholl. Processing Read-only Transactions in Hybrid Data Delivery

Environments with Consistency and Currency Guarantees. MONET 8(4): 327-342, Aug.

2003.

Publications that arose also from the author’s research work and are not covered by the thesis are as follows:

• A. Seifert and M. H. Scholl. COBALT — A Self-Tuning Load Balancer for OLTP Systems,

submitted for publication, Oct. 2004.

• J.-J. Hung and A. Seifert. An Efficient Data Broadcasting Scheduler for Energy-Constraint

Broadcast Servers, accepted for publication in the International Journal of Computer Systems

Science and Engineering (IJCSSE), Jan. 2005.

1.4 Outline

The outline of the rest of the thesis is as follows:

• In the next chapter, we first provide background information on the basic concepts of wireless

data communications, highlight the specific characteristics of popular wireless network types,

and discuss the various types of asymmetry prevalent in mobile data networks. We enumerate

the limitations of mobile computing systems and discuss their impact on the provision of data

consistency and currency in the presence of concurrency in mobile computing environments.

• In Chapter 3 we give several reasons why traditional data pull delivery and its inverse ap-

proach, i.e., data push delivery are not appropriate for building large scale dissemination-

based systems and present the concept of hybrid data delivery as a suitable alternative delivery

mode for such systems. We then present possible configuration options of the communication

media to facilitate hybrid data delivery and identify the major underlying assumptions of the

thesis. The chapter concludes with a survey of the literature discussing technological aspects

relevant for the implementation of hybrid data delivery systems which are not covered in the

remainder of the thesis. 1.4. Outline 15

• Chapters 4, 5, and 6 constitute the main body of this thesis. Chapter 4 points out that cur-

rently available definitions of ILs are not appropriate for managing read-only transactions,

since they lack any data currency guarantees and hence may lead to wrong decisions. To re-

solve this problem, we then propose four new ILs which provide useful data consistency and

currency guarantees to application programmers and users alike. Next, we present suitable

implementations of the proposed ILs which take account of the various limitations and pecu-

liarities of a mobile hybrid data delivery system. The chapter concludes with the presentation

of the experimental results of a detailed performance study, showing the performance devi-

ations among our protocols, and demonstrating the superiority of MVCC-SFBVC (which is

our best performing CC protocol) over other previously proposed schemes.

• In Chapter 5 we introduce a novel multi-version integrated cache replacement and prefetching

algorithm, called MICP. The chapter first motivates the need for yet another cache replace-

ment and prefetching policy and then presents the system design and basic assumptions of

MICP. We describe in detail how MICP determines cache replacement victims, how it orga-

nizes the client cache, and how it reduces the computational cost for making its replacement

and prefetching decisions. Again, the chapter completes by a performance study validating

the applicability and practicability of our algorithms in fulfilling the data storage and preser-

vation requirements of read-only transactions. Additionally, the performance penalty of using

other caching and prefetching strategies than MICP is quantified.

• Chapter 6 tackles the challenging problem of providing serializability guarantees along with

good performance to mobile applications that access, modify, and insert information into

a widely shared, universally accessible database. It presents MVCC-*, a suite of five new

MVCC protocols, whose principle design pays particular attention to the peculiarities of mo-

bile hybrid data delivery systems. It also discusses possible extensions of the suite’s protocols

targeted towards identifying, and thus reducing, the number of false or permissable data con-

flicts among those detected by them. As in previous chapters, the presentation of results of

numerical experiments conducted to show the performance trade-offs among the protocols of 16 Chapter 1. Introduction

the MVCC-* suite themselves and between the suite’s protocols and the Snapshot Isolation

scheme conclude the chapter.

• Finally, Chapter 7 summarizes the contributions and results of our work, and indicates some

possible future research directions. “Once a new technology rolls over you, if

you’re not part of the steamroller, you’re

part of the road.”

– Stewart Brand

Chapter 2

Background

The ultimate goal of mobile computing is to allow mobile users to access external data resources and to provide consistent and timely access to their own information collections to anyone, at any- time, from anywhere. The key enabling component to facilitate mobile computing is, besides the portability of the mobile devices, ubiquitous connectivity, i.e., connectivity at any place, any time.

Wireless data technology provides mobile users with the capability to have access to all types of data, including plain text, photos, graphics, audio, and video. Due to the importance of wireless technology for mobile computing, in the following section we attempt to briefly describe the basics of wireless data communications, highlight the specific characteristics of various wireless network types, and finally discuss data management issues arising due the various types of asymmetry that occur in mobile data networks. Then, we enumerate the various limitations of mobile computing systems such as frequent disconnection, limited battery life, low-bandwidth communication and reduced storage capacity and discuss their impact on our objective of providing appropriate data consistency and currency guarantees to both read-only and read-write transactions along with good performance in spite of the presence of failures (e.g., disconnections) and concurrency in the sys- tems.

17 18 Chapter 2. Background

2.1 Basics of Wireless Communication Systems

A wireless communication system typically consists of numerous wireless communication links which themselves include in their most primitive form a transmitter, a receiver, and a channel.

Wireless communication links can be classified as simplex, half-duplex or full-duplex. In simplex systems, communication is constrained in only one direction. A data dissemination server that broadcasts data to mobile users without the provision of a back-channel or an uplink channel to the server is an example of a simplex system. Half-duplex systems provide means of two-way communication, however, they use the same physical channel for data transmission and reception and, therefore, the mobile user can either receive or transmit information at any given instance of time. Examples of half-duplex systems are IrDA infrared links or most 802.11 WLAN links. Full- duplex systems such as cellular radio network systems allow mobile users to simultaneously send and receive data by provision of either two separate radio channels (frequency division duplexing or FDD) or adjacent time slots on a single radio channel (time division duplexing or TDD).

The two main propagation technologies used in wireless communication systems are infrared

(IR) and radio frequency (RF). Among those, RF is regarded as more flexible and practical as it propagates through solid obstacles such as walls [36]. To provide radio communication service simultaneously to as many mobile users as possible over a wide area, wireless multiple-access techniques and frequency reuse must be adopted. There are two main types of multiple access techniques, namely demand-assign based and random multiple access [8], and in practice their deployment depends on the data traffic requirements. If continuous data flow along with high re- sponsiveness is required, demand-assign based multiple access is applied. In this technique, the available radio channels are divided in a static fashion and each user is exclusively assigned one or more of those channels by a base station irrespective of the fact that it might not require the entire bandwidth capacity of the channel. Prominent examples of the demand-assign based multi- ple access method are frequency division multiple access (FDMA), time division multiple access

(TDMA) [133], and code division multiple access (CDMA) [129]. In FDMA the available radio frequency spectrum S is divided among U simultaneous users such that each user is allocated with 2.1. Basics of Wireless Communication Systems 19 a channel bandwidth of C = S/U Hz. In TDMA the radio frequency spectrum is divided into time slots that are allocated among the users. In this system, a radio channel can be thought of as a particular time slot that cyclically re-occurs every transmission period. In CDMA all users share the same carrier frequency and are allowed to transmit data simultaneously. CDMA is implemented with direct sequence spread spectrum or frequency hopping and each user is allocated a different pseudo-random spreading code or hopping pattern, respectively, which separates users from each other.

Code Code Code

Channel N Channel 2 Channel 1 Channel 1 Frequency Frequency Channel 2 Frequency

Channel 1 Channel 2 Channel N Channel N

Time Time Time (a) FDMA (b) TDMA (c) CDMA

Figure 2.1: Demand-assign based multiple access techniques.

If network bandwidth requirements are highly dynamic with a high peak-to-average rate ratio, random multiple access systems are used, in which many users share the same radio channel and transmission occurs in a random, i.e., uncoordinated, or partially coordinated way. Popular ex- amples are the ALOHA protocol [2] and the carrier sense multiple access (CSMA) protocol [92].

ALOHA is a contention-based scheme which allows users to access a channel whenever a message is ready to be transmitted. The sender then listens to the channel to receive the acknowledgment feedback to determine whether the transmission was successful or not. In case a packet collision occurs, the sender waits a randomly determined period of time, and then retransmits the packet. In the CSMA protocol, users monitor the status of the channel and do only transmit messages, if the channel is idle. Obviously, packet collisions may still occur since just after a user sends a packet, another user may be sensing the channel and may detect it to be idle. Consequently, it will send its own packet, resulting in a collision with the other.

In addition to using demand-assign based and random access methods separately, variants of both access techniques can be combined to take advantage of both approaches. A hybrid scheme 20 Chapter 2. Background has the advantage that if both underlying access methods are fine-tuned to each other, the available radio channels may neither become overload nor underloaded in peak-load situations. This is because the demand-assign based access method assigns radio channels on a request basis and can therefore act as an admission control. If multiple users requiring network services are likely to produce random data traffic, they are grouped together by a demand-assign based access technique to operate through the same radio channel. Users of the same group then access the radio channel based on a random multiple access technique.

In order to provide wireless communication services to many users over a wide coverage area, frequency reuse strategies that typically use spatial separation, either in the cellular distance, an- tenna angle, or signal polarization can be exploited. All those techniques have in common that they enable multiple radio channels in geographically separate areas or cells to use the same radio spectrum with relatively low co-channel interference. Applying one or more of those strategies in- creases the capacity of the radio network and ultimately contributes to the efficient use of the radio spectrum which is a scarce natural resource of finite limits.

As the knowledgeable reader has certainly recognized, there are many more technological as- pects of wireless communication systems such as radio wave propagation and interference, channel coding, , etc, that were not covered as far. The discussion of those topics, however, is beyond the scope of this thesis and, therefore, we refer the interested reader to [54, 128, 132].

2.2 Wireless Network Types

To get an overview on the characteristics, capabilities, and limitations of existing and proposed mo- bile communication networks and to be able to evaluated them w.r.t. to their applicability to support data-intensive transaction-based mobile applications, we now give a more detailed description of them. As with wired networks, wireless communication networks can be classified into different types based on the geographic scope, i.e., the distances over which data can be transmitted: 2.2. Wireless Network Types 21

Wireless Wide Area Networks (WWANs)

WWANs connect large geographic areas, such as cities or countries, via multiple antenna sites

(cellular base stations ) or satellite systems. Currently deployed WWAN technologies correspond to second-generation (2G) and third-generation (3G) wireless cellular systems and communication satellite systems whose main feature will be briefly described below. The majority of the 2G cellu- lar networks deployed on Earth is based on the Global System Mobile (GSM) standard which has been deployed by carriers in Europe, Asia, Australia, South America and some parts of the US.

Other standards include the Interim Standard 136 (IS-136), Pacific Digital Cellular (PDC), and In- terim Standard 95 (IS-95) which are used by service providers in North America, South America, and Australia (IS-136), Japan (PDC), and North America, Korea, Japan, China, South America, and Australia (IS-95), respectively. In 2G systems, the raw data transfer rate is only 9.6 Kbps, which is certainly to slow for data-intensive applications. Currently deployed, so-called 2.5G net- works which include High Speed Circuit Switched Data (HSCSD), General Packet Radio Service

(GPRS), Enhanced Data Rate for GSM Evolution (EDGE), and IS-95B provide much higher raw transmission rate of up to 57.6 Kbps, 171.2 Kbps, 384 Kbps and 115.2 Kbps, respectively [132].

In a loaded networks, however, achievable data rates are much lower. For example, the GPRS rate of 171.2 Kbps (8 × 21.4 Kbps) per channel includes all the Radio Link Control/Media Access

Control (RLC/MAC) overhead. After subtracting the protocol overhead required for the sharing of the radio channel, the actual data rate the user sees is 130.24 Kbps. This rate, however, will only be achieved in situations where all 8 slots in a TDMA frame are dedicated to the user and the transmis- sion itself is error-free. Both assumption are highly unlikely in a congested wireless network and consequently, experienced GPRS throughput rates are only in the range of 10-40 Kbps downstream and 10-20 Kbps upstream [105]. EDGE networks provide significantly higher data rates ranging from 30-200 Kbps for the downlink channel and 30-60 Kbps for the uplink channel [156]. How- ever, despite the significant bandwidth improvement of EDGE networks compared to 2G networks, the transfer of a relatively small replica file of 1 MB will nevertheless take about 40 seconds at a maximum data rate of 200 Kbps.

As for the transition from 2G to 2.5G, the primary incentives for the transition from 2.5G to 3G 22 Chapter 2. Background are increased data rates and greater network capacity for operators. As 2.5G systems, 3G networks are characterized by built-in asymmetry in the uplink and downlink data rates. The uplink data rate is limited by battery power consumption and complexity limitations of the mobile terminals and a user is able to achieve about 64 Kbps. The bandwidth available in the downlink direction is 3-6 times higher than in uplink direction and pedestrian users are supported by downlink data rates of up to 384 Kbps. Despite the increased bandwidth capacities and worldwide roaming capabilities of 3G networks, the rollout of 3G has been delayed primarily for lack of good inexpensive handsets and other technical issues. As a result, by the end of 2003 [135] only eight commercially operating

3G systems were deployed. Note, however, that good handsets are now starting to appear at the market in greater numbers, and most of the technical issues have been resolved which will certainly help 3G networks to grow.

As terrestrial cellular networks, satellite systems are rapidly evolving to meet the demand for mobile services. Due to its large communication coverage area, satellite systems can be considered as a complementary technology to cellular networks in un-populated and low traffic areas where cellular networks are not competitive. However, in addition to providing “out-of-area” coverage to mobile users, recent developments in satellite technology such as narrow beam antennas and switchable spot beams enable satellites to be used to off-load congestion within highly populated cellular network areas and to provide mobile application users with uplink and download bandwidth that is up to an order of magnitude larger than in cellular networks. In what follows, we briefly describe the various types of satellite system and discuss their main characteristics.

Satellites systems can be classified according to their orbital altitudes into: (a) low-altitude

Earth orbit (LEO) satellites residing at about 500 − 2000 km above the Earth, (b) medium-altitude

Earth orbit (MEO) satellites circulating at about 5000 − 20000 km above the Earth, and (c) geo- stationary orbit (GEO) satellites located at 35,786 km above the Earth.

Today, the majority of satellites links is provided by GEO satellites. GEO satellites are tied to the rotation of the Earth and are therefore in a fixed position in space in relation to the Earth’s surface. Thus, each ground station is always able to stay in contact with the orbiting satellite at the same position in the sky. Due to the high orbit of GEO satellites, their “footprints”, i.e., the 2.2. Wireless Network Types 23 ground areas that are covered by their transponders (transmitters), are large in size. GEO satellites

“see” almost a third of the Earth, i.e, it takes only three GEO satellites to cover (almost) the whole

Earth. However, there are various obstacles as well: (a) Due to the long signal paths, the theoretical propagation delay of the signal to travel the distance from the ground station to the satellite and back again is 239.6 ms [106]. Therefore, the propagation delay of a data message and its corresponding reply (one round-trip time (RTT)) takes at least 479.2 ms. In practice, however, signal RTTs are slightly higher ranging from 540 − 600 ms [65] depending on how fast the satellite and ground station can process and re-send the signal. (b) As GEO satellites cover a large area with diameter in the range of about 10,000 − 15,000 km, available radio frequencies are inefficiently used. Note that this problem can be alleviated by using spot beams. However, despite this technology GEO systems will never be as efficient as equivalent LEO systems. (c) There is another problem due to the long signal path between the Earth and the satellite. As the strength of a radio signal falls in proportion to the square of the distance traveled, either a very high signal transmit power is required or large receiver antennas or an appropriate combination of both.

MEO satellites are mainly used in geographical positioning systems (GPS) and are not station- ary in relation to the rotation of the Earth. MEO satellites as well as LEO satellites require the use of constellations of satellites for constant coverage. That is, as one satellite leaves the ground station’s sight, another satellite appears on the horizon and the channel is switched to it. Due to or- biting at an altitude of only about 1/3 of GEO satellites, they obviously incur less round-trip delay

(220 − 300 ms [87]) than GEO satellites, but also have smaller footprints.

LEO satellites have become very popular in the last few years as demand for broadband com- munication has surged. Compared to GEO and MEO satellites, LEOs provide many advantages: (a)

Only low transmission power is required to reach a LEO satellite which opens the door to pocket- sized transceivers. (b) Due to orbiting closer to Earth, short RTTs of about 40 − 50 ms [87] are achievable. Note that there could be additional delays for global LEO networks over the terrestrial network that could bring the RTT up to 100 ms. (c) As the satellite orbit is closer to Earth, LEOs are also cheaper to launch. However, deploying LEO satellites is not free of problems: (a) As LEOs are closer to Earth, their footprint is relatively small. Consequently, about 40 − 80 satellites are re- 24 Chapter 2. Background quired to attain a global coverage. (b) As LEOs orbit close to Earth, they are forced to travel at high speeds so that gravity does not pull them back into the atmosphere. LEO satellites achieve speeds in the range from 4,000 − 27,000 km/h and thus, circle the Earth in about 1.5 − 10 h. Hence, the satellite is only in sight for about 5 − 30 minutes and thus inter-satellite hand-overs are frequent.

(c) Satellites experience orbital decay and have physical lifetimes determined almost entirely by their interaction with the atmosphere. As the atmospheric density increases progressively towards the ground, the orbital decay is much higher at lower altitudes. Thus, LEOs with 5 − 10 years have a much short lifetime than MEOs or GEOs. Table 2.1 on page 28 summarizes various key performance metrics of cellular and satellite systems. As the satellite footprint and its downlink and uplink data rate is system-specific, the figures in Table 2.1 refer to concrete systems, namely iPSTAR (GEO) and Teledesic (LEO) [37].

Wireless Metropolitan Area Networks (WMANs)

As the name implies, WMAN technologies enable users to establish fixed wireless connections within a metropolitan area. The attribute “fixed” refers to the fact that the nodes exchanging ra- dio frequency signals with each other are stationary, unlike mobile wireless technology. WMANs offer an alternative, wireless means to cabled access networks, such as fiber optic links or coaxial systems, for delivering broadband services to rural businesses and residences. Compared to cabled access networks, WMAN technology is less expensive and faster to deploy and has the potential to lead to more ubiquitous broadcast access as it provides network service options to areas with no or insufficient wired infrastructure. WMANs technology can be classified into two groups: (a) Multi- channel Multi-point Distribution Service (MMDS) and (b) Local Multi-point Distribution Service

(LMDS).

MMDS systems operate in the 2.1 GHz to 2.7 GHz band and provide a line-of-sight (LOS) service which means that data transmission does not work well around mountains, buildings, or any other type of signal barrier. MMDS systems have a service range of up to 50 km and maximum uplink and downlink bandwidth speeds of 256 Kbps and 10 Mbps, respectively [134]. LMDS is a broadband wireless technology that occupies the largest chunk of spectrum (1,300 MHz in the 2.2. Wireless Network Types 25

US) ever devoted to any wireless service. LMDS utilizes RF technology in the 27.5 − 31.225 GHz band in the US and 40.5 − 43.5 GHz band in Europe for data transmission. Due to the high trans- mission frequencies, LMDS is capable of providing downlink and uplink bandwidth in the range of

20 − 50 Mbps and 10 Mbps, respectively [47]. However, compared to MMDS, LMDS has a much smaller coverage ranging from 3 − 5 km. Like MMDS, LMDS requires LOS and is susceptible to environmental influences such as rain. Finally, note that since 1999, the IEEE 802.16 working group [72] is developing specifications to standardize the development of WMAN technologies.

Wireless Local Area Networks (WLANs)

WLANs are smaller-scale wireless networks with a distance coverage range of up to

100 meters [16]. WLANs are around since the late 1980’s and can be seen either as a replace- ment or as an extension of wired Ethernet. The most prevalent form of WLAN technology is called Wireless Fidelity (WiFi), which includes a host of standards including 802.11a, 802.11b, and 802.11g. 802.11b, approved by the IEEE in 1999, is an upgrade of the 802.11 standard, ap- proved in 1997, which raised the transmission speed from 2 to 11 Mbps which is approximately the same bandwidth capacity of wired Ethernet connections. Technically, 802.11b is a half-duplex protocol which operates in the highly populated 2.4−2.483 GHz industrial, scientific, and medical

(ISM) band and its transmission distances vary from 20 − 100 meters depending on the equipment used and the configuration selected. 802.11a has been approved in Sep. 1999 and operates in the

5.15 − 5.25, 5.25 − 5.35, and 5.725 − 5.825 bands and thus avoids the inference problems experi- enced by the 802.11b technology due to other products operating in the same frequency spectrum.

802.11b employs 300 MHz bandwidth in the 5 GHz band which accommodates 12 independent, non-overlapping 20 MHz channels compared to 3 bands in 802.11b. Each channel supports up to

54 Mbps of throughput, shared among the mobile users operating in the same channel. As higher frequency signals have more difficulties propagating through physical obstructions than those at

2.4 GHz, the operational range of 802.11a with 5 − 30 meters is somewhat less than 802.11b [16].

802.11g, approved in June 2003, attempts to combine the best of both 802.11a and 802.11b. 802.11g supports bandwidth up to 54 Mbps by using 802.11a’s orthogonal frequency division multiplexing 26 Chapter 2. Background

(OFDM) modulation technique [46], it uses the 2.4 GHz frequency for greater coverage range and is compatible with equipment based on the earlier 802.11b wireless standard.

Wireless Personal Area Networks (WPANs)

WPANs are short to very short (up to 10 meters) wireless ad-hoc networks that can be used to exchange information between devices such as PDAs, cellular phones, laptops, or sensors located within communication distance. Currently, the two dominant WPAN technologies are Bluetooth and IrDA. Bluetooth [28] named after the Viking king, Harald I Bluetooth, who unified Denmark and Norway in the 10-th century, operates in the ISM frequency band at 2.402-2.483 GHz in the

US and in most countries in Europe. This band is divided into 79 (1 MHz wide full-duplex) channels, where each channel provides a data rate of 723.2 Kbps (raw data rate 1 Mbps). Bluetooth, if applied for ad-hoc networking purposes, can support up to 8 devices – one of them selected as a master – which together form a piconet similar to, but much smaller than, a IEEE 802.11 cell. To allow network nodes to form networks larger than 8 nodes, a node can act as bridge between two overlapping piconets and create a larger network called a scatternet. The scatternet architecture allows devices to communicate with each other that are not directly connected to each other due to long distances or too many devices sharing the same spatial location. The Bluetooth specification is driven by the Bluetooth Special Interest Group (SIG) and in its currently valid version 1.2 [27], it defines two types of radio links to support voice and application data, namely synchronous connection-oriented (SCO) and asynchronous connection-oriented (ACL1). An SCO link is a symmetric, point-to-point link between the piconet master and a specific slave and can be considered as a circuit-switched network. SCO provides up to three synchronous 64 Kbps voice channels which can be used simultaneously. An ACL link is packet-switched and thus, is intended for packet transfer, both asynchronous and isochronous (i.e., time-sensitive). An ACL link can support an asymmetric link of maximally 723.2 Kbps in the downlink direction and 57.6 Kbps in the uplink direction, or a 433.9 Kbps symmetric link [27].

1It is obvious that the most appropriate abbreviation for asynchronous connection-oriented is ACO. However, this acronym has an alternative meaning in the Bluetooth 1.1 specification and is therefore not used in version 1.2. 2.2. Wireless Network Types 27

The IrDA communication standard for transmitting data via infrared light waves was developed by the Infrared Data Association (IrDA) [81] which is a industry-based group of over 150 companies that was formed in 1993. IrDA is a LOS, point-to-point, ad-hoc data transmission standard. IrDA operates in half-duplex mode since full-duplex communication is not possible as a device while transmitting data is blinded by the light of its own transmitter. In order to minimize interference with surrounding devices, IrDA devices transmit infrared pulse in a cone that extends 15 degrees half angle off center. IrDA is designed to operate over a distance of up to 1 meter and at data speeds that fall into four different categories, namely (a) Serial Infrared (SIR) supporting speeds up to

115.2 Kbps, (b) Medium Infrared (MIR) supporting 0.576 Mbps and 1.152 Mbps data rates, (c)

Fast Infrared (FIR) supporting a 4.0 Mbps data rate, and (d) Very Fast Infrared (VFIR) supporting

16.0 Mbps [80]. Although Bluetooth has been invented as an enhancement of IrDA and it actually copes with IrDA’s limitations by providing omnidirectional rather than unidirectional connections, by extending the connectivity range between devices from 1 to 10 meters, or by supporting point- to-multipoint connections, the two technologies are quite complementary. While Bluetooth is very well-suited for building ad-hoc personal area networks, Infrared is more appropriate for establishing high-speed point-to-point connections, e.g., for synchronizing personal and corporate data between the mobile device and desktop.

Concluding Remarks on Wireless Network Types

There are at least three interesting observations to be made about the previously described wireless communication alternatives: First, their is no wireless network type that is universally superior to the others, i.e., each has advantages and disadvantages w.r.t. the others. Due the different design and usage scenarios of the various existing wireless network types, the majority of them is comple- mentary to, rather than competitive with each other. To leverage the advantages of each individual network type, it is desirable to seamlessly incorporate them into a hybrid network that allows data to flow across the individual network boundaries, using many types of media, either satellite, wire- less or terrestrial, transparently. Mobile users should be able to select their favorite network type based on availability, QoS specifications, and user-defined choices. Currently, the issue of support- 28 Chapter 2. Background

Wireless Wireless Coverage Downstream Upstream Network RTT Network Type Range Bandwidth Bandwidth Technology 500 − 1000 ms GPRS up to 120 km 10 − 40 Kbps 10 − 20 Kbps [59] 384 Kbps − 250 − 340 ms 3G (CDMA) up to 35 km 64 Kbps 2 Mbps [59] WWAN 10000 − 15000 10 Mbps 2 Mbps Satellite (GEO) 540 − 600 ms km (iPSTAR [37]) (iPSTAR [37]) 160 km 64 Mbps 2 Mbps Satellite (LEO) 40 − 50 ms (Teledesic [37]) (Teledesic [37]) (Teledesic [37]) MMDS up to 50 km 10 Mbps 256 Mbps 10 − 20 ms WMAN LMDS 3 − 5 km 20 − 50 Mbps 10 Mbps 10 − 20 ms 802.16 3 − 5 km 70 Mbps 25 Mbps 10 − 20 ms (Wi-Max) 802.11a 5 − 30 meters 54 Mbps 54 Mbps 10 − 20 ms WLAN 20 − 100 802.11b 11 Mbps 11 Mbps 10 − 20 ms meters 723.2 Kbps 57.6 Kbps Bluetooth 10 meters (asynchronous (asynchronous ∼ 10 ms [43] WPAN mode) mode) 10 − 20 ms IrDA up to 1 meter 16 Mbps 16 Mbps [43]

Table 2.1: Various characteristics of current and emerging wireless network technologies. ing global roaming across multiple wireless and mobile networks is one of the most challenging problems faced by the developers of 4G network technology whose deployment is not expected until 2006 or even later.

Second, the majority of wireless networks exhibits asymmetry in their network characteristics, i.e., the network load and service characteristics in one direction are quite different from those in the opposite direction. Wireless communication asymmetry can take various forms and can be classified as follows: (a) bandwidth asymmetry, (b) data volume asymmetry, (c) media access asymmetry, and (d) packet loss asymmetry [3, 17, 18]. Bandwidth asymmetry is the most obvious form of asymmetry and is characterized by a difference in the bandwidth capacity of the uplink and downlink channels. Bandwidth asymmetric ratios between the downstream and upstream paths vary significantly between and even within the same network type and range from about 3:1 to 6:1 in case of cellular packet radio networks to up to 640:1 in case of digital satellite TV networks

(e.g., AirTV) [12, 165]. Bandwidth asymmetry occurs because of the way how available radio re- 2.2. Wireless Network Types 29 sources are allocated to the uplink and downlink channels. There are technological, economical, and usage-related reasons for the increasing popularity of asymmetric networks. Due to the ex- pense of the equipment and the high power capabilities of the transmitter required to provide a high bandwidth uplink channel, asymmetric wireless networks are often constructed. As will be noted below, many mobile applications have asymmetric communication requirements and thus, the de- ployment of bandwidth asymmetric networks is highly desirable. Data volume asymmetry arises due to the divergence in the amount of data transmitted in uplink and downlink direction. Unlike full-duplex conversational voice communication, where the traffic volumes of the uplink and down- link are usually similar to each other, many mobile data communication applications, such as web browsing, streaming live video or file transfer, place higher demands on the downlink than on the uplink communication capacity. In such applications, short data requests on the order of several tens of bytes are transmitted through the uplink, whereas much larger data files on the order of several tens or hundreds of KB are transmitted in the opposite direction. This fact along with the anticipated dominance of wireless data services in the future has encouraged network providers to deploy asymmetric communication networks which, in turn, prevents bandwidth waste and capacity degradation which would otherwise arise [84]. Media access asymmetry occurs due to lower MAC costs incurred in transmitting data from a central cellular base station or satellite ground station to a collection of geographically distributed mobile clients than in the opposite direction. The cause is related to the hub-and-spokes network model underlying the majority of wireless networks. In this model, a central coordinating entity (e.g., base station) has complete knowledge and control over the downlink channel(s) and hence suffers a lower MAC overhead than the mobile host that com- pete for the uplink. Packet loss asymmetry takes places when the wireless network is significantly more lossy in one direction than in the other. In wireless networks, the uplink path is significantly more error-prone than the downlink path since mobile hosts transmit at much lower power and need to contend for uplink channel slots whereas high-powered base station can transmit relatively loss-free.

Third, some of these network types are broadcast-based (e.g., Direct Broadcast Satellite (DBS),

LDMS, MMDS), which means that they inherently support information dissemination to many 30 Chapter 2. Background users, possibly spread in a large geographic area, over a shared radio link and without any inter- mediate switching. Data broadcasting or data dissemination is an attractive data delivery model for a number of newly emerging as well as traditional applications such as stock market and sports tickers, news delivery, traffic information and guidance systems, video and audio entertainment, emergency response and battlefield applications, etc. Compared to unicasting, broadcasting has three major benefits if used for applications that require data to be transmitted to many destinations:

(a) First, it is more bandwidth- and energy-efficient since requests for the same data object by mul- tiple clients can be satisfied simultaneously by the server with a single transmission rather than as multiple point-to-point transmissions. (b) Second, it reduces the load on both the broadcast server and mobile clients as locally missing data objects that are scheduled for broadcasting do not need to be requested from the server, but rather can be downloaded from the broadcast stream as they pass by. Obviously, this also improves the scalability of the system as the uplink channels as well as the server can support more clients before they become a system bottleneck. (c) Last, but not least, broadcast data delivery provides a convenient and cost-efficient way to mitigate the effects of voluntary and unplanned disconnections and to enforce cache and transaction consistency.

Driven by the numerous advantages of data broadcasting for one-to-many applications and their gaining popularity amongst the users, we choose periodic data broadcasting as the main data de- livery model for this thesis and study various technological aspects centered around the objective of building efficient data dissemination systems in the next chapter. Before doing so, however, we briefly enumerate the various limitations of mobile computing systems and discuss their impact on the objective of providing data consistency and currency guarantees efficiently to mobile users.

2.3 Limitations of Mobile Computing and their Impact on Mobile

Transaction Processing

As the technological trends given in the introductory chapter indicate and everyone actively partic- ipating in today’s public life experiences, the wireless media play an increasingly important role in everybody’s computing environment. However, providing reliable and efficient data commu- 2.3. Limitations of Mobile Computing 31 nication and processing support to mobile users is more challenging than to users connected to a

fixed network such as a wireline local area network (LAN) due to the various constraints placed upon mobile hosts. Environmental and system-immanent constraints of mobile systems are scarce local resources (e.g., processors of today’s PDAs run at a speed of 16 to 400 MHz and their mem- ory capacity ranges from 2 to 128 MB [19]), high packet loss rates, large transmission latency, frequent bandwidth changes, variability of the supporting network infrastructure, poor data secu- rity, etc. Most of these will not be eliminated by technological progress in the near future. As a consequence, a plethora of new research challenges are generated for the mobile computing com- munity [13, 45, 73, 74, 125]. The provision of efficient access to consistent and current data, being the main theme of the thesis, is one of them. The task of providing efficient transactional database- support to mobile users differs from that in stationary systems due to the following non-exclusive list of facts:

• Mobile computers do not enjoy the continuous connectivity provided by e.g., wireline LANs.

A mobile user may voluntarily (to save battery power or to reduce network service costs) or

involuntarily (due to poor microwave signal reception) disconnect from the mobile computing

network for an arbitrary amount of time. If users wish to keep transactions active even though

being disconnected from the network, transactions may become extremely long lived and

their operations are more likely to conflict with other concurrently active transactions.

• To ensure portability of mobile devices, they are battery-powered and built light and small in

size, making them much more resource-poor than, e.g., desktop devices. As a consequence,

mobile hosts provide either no or only restricted storage capacities to replicate interesting

subsets of the database and thus, the majority of the required data needs to be either directly

requested from the server or alternatively downloaded from the air-cache [146] or broadcast

channel, provided that frequently requested database objects or even a complete database

snapshot is disseminated to the client population. Clearly, if large portions of data need to be

obtained through the wireless network, transaction execution times are prolonged, ultimately

increasing the probability that transactions cannot be successfully reconciled into a common 32 Chapter 2. Background

database state.

As the enumeration given above indicates, the constraints inherent to the mobile computing systems affect transaction processing among others by the fact that data conflicts due to multiple concurrent read-write transactions operating on the same shared data may occur more frequently than in fixed distributed database systems. Consequently, conflict avoidance/reduction, detection, and resolution is a major research issue in mobile transaction processing and various measures intended to cope with the problem are briefly discussed below.

2.3.1 Techniques to Avoid or Reduce Reconciliation Conflicts

As techniques that completely avoid data conflicts such as processing transactions in a sequential manner or enforcing application users and their invoked transactions to operate solely on privately owned non-shared fragments of the database are undesirable from a performance and data acces- sibility perspective, the essential issue to address here is to investigate and evaluate strategies to reduce the number of data conflicts naturally arising in the presence of concurrent accesses and updates to widely shared data objects.

Reducing the probability of data conflicts by improving transactions’ responsiveness

Viable options to reduce the number of conflict situations are manifold and should be exploited as much as possible by mobile system designers. There are quite a few measures centering around the goal of optimally utilizing scarce mobile networking and computing resources which help to reduce transaction response times and as a side effect may diminish the data conflict rate defined as the ratio of the number of conflicting to non-conflicting data operations issued by concurrent transactions.

As a matter of fact, irrespective of the network technology used, bandwidth provided by wireless networks is relatively low in comparison to fixed networks and the deliverable bandwidth per user is even lower if it needs to be shared among users as it is the case in today’s cellular networks (e.g.,

GSM (Europe), PDC (Japan), and IS-95 (US)) around the globe.

Data broadcasting is an energy- and time-efficient approach to compensate, at least partially, for the prevalent bandwidth limitations of mobile networks. Rather than allotting dedicated RF 2.3. Limitations of Mobile Computing 33 channels to mobile user individually and answering their requests in a unicast fashion, data broad- casting takes advantage of the fact that data access patterns are typically skewed in nature [68], and, therefore, it is likely that multiple requests for the same object are pending in the request queue of the server, thus can be satisfied simultaneously. Consequently, data broadcasting is more bandwidth-efficient than point-to-point communication as more outstanding requests can be served at the same time. Another benefit of data broadcasting is its energy-efficiency which is maximized when a pure push-based data dissemination approach is applied. Using push-based data commu- nication is more energy-efficient than the traditional pull-based method since transmitting data is about twice as energy-consuming than receiving messages [86,94,143]. With pure push-based data delivery, the server provides data transfer without the reception of specific client requests and en- ables clients to receive required data via the broadcast channel without switching into the expensive transmit mode. Note that data broadcasting is also more energy-efficient than unicasting if embed- ded into a hybrid or even pull-based data delivery system since clients always have the opportunity to search the broadcast channel for the non-cache-resident object first before requesting it from the server. Finally, data broadcasting may help mobile clients that run data-intensive applications to improve their responsiveness. As indicated before, the wireless broadcast channel can be treated as a global air-cache between the server and the client which extends the traditional memory hierarchy structure of the wired client-server environment [3]. As such it can be used to reduce the average data access costs required to fetch non-cached data objects from the server. Note, however, that the instantaneous costs of retrieving an object from the air-cache is variable and depends on the respec- tive object’s position in the broadcast cycle. Consequently, following the strategy of waiting until the requested object appears on the broadcast may not always be optimal since fetching the data directly from the server via some dedicated radio channel may be cheaper especially if the network and broadcast server load is low and a large number of data objects are scheduled to be broadcast.

Two other closely interrelated approaches to speed-up transaction execution time is to prefetch data objects in anticipation of their access in the recent future and to apply a judicious cache replace- ment strategy that always chooses the data object with the lowest utility of caching as replacement victim. Designing such policies is a non-trivial task and involves the consideration of many fac- 34 Chapter 2. Background tors, ranging from the recent history of client data references to not so obvious issues such as the structure and contents of the broadcast program or the caching and versioning policy of the server.

For a more in-depth discussion on this topic as well as the presentation of a concrete cache replace- ment and prefetching policy that is particularly suitable for transaction-oriented data dissemination applications, we refer to Chapter 5.

Reducing the probability of data conflicts by deploying fine-granularity CC

A slightly different, but equally important conflict-reducing strategy is to apply CC on a fine rather than a coarse granularity level. The enforcement of the approach avoids or at least diminishes the number of false or permissible data conflicts otherwise detected by the client transaction manager or reconciler at the server and, therefore, contributes to the performance improvement of the sys- tem. For an illustrative example showing how false conflicts may occur, we refer the reader to the previously presented Example 2 on page 4.

Reducing the probability of data conflicts by keeping multiple data versions

Another effective method to reduce the number of conflicts among concurrently executing trans- actions is to use versions for CC. Multi-versioning helps the scheduler to increase its scheduling power in the sense that it is able to produce more correct, i.e., conflict serializable, histories than a mono-version scheduler [161] and, therefore, it appears to be a promising technique for syn- chronizing mobile distributed transactions. In the standard mono-version transaction model, read operations are always directed to the most recent version of a requested object, thereby limiting the scheduling power of the transaction manager as transactions may sometimes need to read an older version of a given object in order for the resulting schedule to remain conflict serializable. Clearly, greater concurrency due to potentially less data conflicts among transactions does not come for free. Additional overhead in terms of space and processing power is required to benefit from this improvement. However, since multiple versions must be supported in a mobile distributed data communication environment anyway and as MVCC protocols empirically outperform their mono- version counterparts (see Chapters 4 and 6 for comparison), it is highly desirable to take advan- 2.3. Limitations of Mobile Computing 35 tage of them by designing appropriate protocols. For a representative example demonstrating how multi-versioning can be used to increase the concurrency in the system, we refer the reader again to a previously discussed example, namely Example 1 shown on page 4.

Reducing the probability of data conflicts by exploiting semantic information

Another complementary approach to reduce the number of performance-reducing data conflicts is to implement CC on a more elaborate transactional model than the well-established read/write model in which synchronization is based on the analysis of read and write operations at system runtime. The limitation of the read/write model is that it does not incorporate any semantic infor- mation of higher level operations into decision making and, therefore, fails to exploit a significant amount of concurrency since it forbids certain execution orders of concurrent transactions that are not conflict serializable, but nevertheless leave the database in a consistent state. A fundamental dis- advantage of semantics-based CC models, however, is that they significantly complicate application programming (as e.g., commutativity tables of higher-level operations need to be specified or coun- ters associated with upper and lower bounds need to be maintained in order to facilitate protocols such as Escrow locking [116]) and they are inherently error prone. Note that an extended discus- sion of exploiting semantic knowledge to increase concurrency in the context of mobile transaction processing is provided in Section 6.4.3.

2.3.2 Techniques to Detect and Resolve Reconciliation Conflicts

To deal with intermittent connectivity, to overcome latency and bandwidth problems, and to reduce server load during service peaks, mobile devices are allowed to replicate subsets or all of the server’s primary database and to operate on those secondary data copies in an unsynchronized fashion. To prevent local database copies to diverge significantly from the server’s version, mobile clients are regularly required to integrate their uncoordinated update operations into the common database state. However, depending on the length of time since the client’s last reconciliation with the server and the update frequency of the database during that period of time, locally generated updates may conflict with other concurrently performed updates in the system and, therefore, some form of 36 Chapter 2. Background conflict detection and resolution is required. The issue of detecting and resolving inconsistencies between database copies can be addressed by using a syntactic or a semantic approach — or a hybrid of both [39].

Syntactic reconciliation approaches enforce scheduling correctness by inspecting the read and write sets of the client transactions executed since the last reconciliation. This approach completely ignores the semantics of the transactional operations as well as the semantics, structure, and usage of the data objects themselves when reasoning about the correctness of transaction interleavings.

Obviously, reconciling divergent database states by only analyzing read and write operations fails to detect some legitimate schedules and thus, may unnecessarily resolve false conflicts.

Semantic reconciliation approaches offset the disadvantage of the syntactic methods by using semantic information about the operations and the objects for automatically integrating divergent database copies. Having higher level application-specific knowledge about the constituent client transactions allows the resolver to use a correctness criterion weaker than Conflict Serializabil- ity [120] for preserving database consistency that we call semantic correctness following Bernstein et al. [24, 25]. Informally speaking, under this criterion a reconciled schedule is semantically cor- rect, if the effect on the database of the interleaved execution of the set of integrated transactions is the same as a serial schedule of the same transactions. To ensure that only semantically correct schedules are produced by the resolver, semantic reconciliation requires the application program- mer to specify conditions under which transaction steps of concurrent transactions are allowed to be interleaved. Certainly, the complexity of analyzing all the different transaction types in the system and specifying conditions for them that they assume to be true before, during, and after their exe- cutions place a great burden on the application programmer. Due to this fact and in order to achieve application-independence, we adopt a purely syntactic reconciliation approach in this thesis. In the same vein, it is also important to note that both reconciliation approaches are rather complementary than competitive concepts. Therefore, our approach can be extended to include object and operation semantics to increase the number of successful reconciliations. “New capabilities emerge just by virtue

of having smart people with access to

state-of-the-art technology.”

– Robert E. Kahn

Chapter 3

Hybrid Data Delivery

In this chapter, we give reasons why traditional request/response data delivery and its inverse ap- proach, i.e., data push delivery, are inadequate for building large scale dissemination-based sys- tems and introduce the concept of hybrid data delivery, the combination of the traditional re- quest/response mechanism with data broadcasting, as a solution. Then, we present possible config- uration options of the communication media to facilitate hybrid data delivery and identify the major underlying assumptions of the thesis. In the remainder of the chapter we review and evaluate the various technological approaches proposed to realize hybrid data delivery systems over a broadcast medium. The key issues addressed here include the organization of the broadcast program, i.e., how much broadcast channel bandwidth should be allocated to each given data object and the indexing of the broadcast channel(s), i.e., which indexing technique provides the best performance/energy consumption trade-off.

3.1 Why to Use Hybrid Data Delivery

Within the last two decades, the two or three tier client-server architecture has been the prevalent distributed computing model. Within this architecture, the traditional request/response or pull-

37 38 Chapter 3. Hybrid Data Delivery based data delivery model has been used. However, pure pull-based data delivery is unsuitable for one-to-many data dissemination applications since it suffers from scalability problems due to the limited uplink bandwidth capacity of wireless networks and the existing upper limit on the service rate of the server at which it serves outstanding requests. Clearly, by following the KIWI approach: “kill it with iron”, the congestion boundary of the server can be successfully increased.

However, taking a hardware-based approach can be uneconomic, especially if the average server load significantly deviates from the worst case load and overload situations occur infrequently.

To overcome the scalability problem of the pull-based technique, push-based data broadcasting or dissemination has been proposed as an alternative data delivery method [6, 30, 164]. Push-based data delivery is a very attractive communication method for information-feed applications such as stock quotes and sport tickers, electronic newsletters, mailing lists, etc., whose success critically depends on the efficient dissemination of large volumes of popular data to a large collection of users. However, as it is often the case, the push-based data delivery method has weaknesses too: (a)

First, as the server lacks feedback information about the popularity of the objects in the database, the server has no means to adjust the broadcast content such that it optimally matches the current data needs of the user population. Consequently, the server may broadcast data objects that nobody requires and/or it may never deliver data objects that the clients desperately need and/or it may transmit less popular objects too often or very popular objects too infrequently. Obviously, all those factors have an adverse effect on the system performance and user satisfaction as precious downlink bandwidth gets wasted, and the QoS is reduced. (b) The second shortfall of push-based data delivery relates to the way how data is retrieved from the broadcast channel. As with magnetic tapes, data in the broadcast channel is accessed sequentially rather than randomly, i.e., the data access time, defined as the time elapsed from the moment a client issues an object request to the point when the required object is completely downloaded by the client, depends on the amount of data being broadcast. Consequently, the more distinct data objects are broadcast and the larger those data objects are, the higher the average data access latency. If we assume that any data object scheduled for broadcasting is equally likely to be accessed and all data objects are disseminated with the same frequency, the average access time for any object is equal to half the amount of time required to 3.2. Hybrid Data Delivery Networks 39 broadcast any member of the broadcast set once. As a consequence, objects to be broadcast need to be selected judiciously such that the average access time does not become unsustainably long.

As the preceding discussion has shown, both data delivery methods have their limitations and thus, are not appropriate for data dissemination applications. To overcome these limitation, the hybrid data delivery approach has been proposed [6, 38, 112, 113, 147], whose basic idea is to combine the push- and pull-based data delivery methods in a complementary manner by exploiting the advantages of each delivery type while avoid their shortfalls. To achieve optimal performance results, hybrid data delivery concentrates on broadcasting only popular data objects and unicasts the rest of the database as demands arises. The amount of data that is considered popular and needs to be broadcast depends on various system and workload parameters such as the ratio of uplink to downlink bandwidth, the skew in the data access distribution, the request rate and its variation over time, etc. As some of those parameters change with time, often in an unpredictable way, adaptive hybrid data delivery techniques have been proposed [112, 146, 147] that implement the advantages of hybrid data delivery, but also adapt to the changes in the workload and client behavior. However, the modeling of an adaptive hybrid data delivery system is beyond the scope of this thesis and we refer the interested reader to the literature listed above for technical details.

3.2 Hybrid Data Delivery Networks

The wireless network infrastructure facilitating hybrid data delivery is called a hybrid data delivery network. From the logical view, a hybrid network is a communication architecture that allows their network members to establish both point-to-point connections with each other and also enables them (e.g., the server) to broadcast information to all other active parties of the network via a broadcast channel. From the physical perspective, hybrid networks may be established either on the basis of:

1. An all-in-one communication medium. In this configuration, both point-to-point communi-

cation as well as data broadcasting is carried out over the same physical medium. A two-way

satellite system which typically consist of asymmetric satellite paths — a broadband satellite 40 Chapter 3. Hybrid Data Delivery

path in the downstream direction (server-to-user) for the delivery of the actual broadcast con-

tents and a narrowband upstream path (user-to-server) for the carriage of the user requests —

would be a network architecture belonging into this category.

2. A separate point-to-point and broadcast medium. In this network setup, the physical com-

munication medium used to transfer user requests to the server differs from that used for

information dissemination. An example network system would be a hybrid satellite system

that consists of multiple two-way terrestrial point-to-point channels (e.g, Public Switched

Telephone Network (PSTN) dial-up, Integrated Services Digital Network (ISDN) dial-up,

or leased-line connection) to allow users to request and receive data objects that are not

scheduled for broadcasting and a broadband satellite channel which is used by the server to

disseminate popular data.

3. A separate upstream and downstream medium. In this network topology, the point-to-point

medium is used exclusively to facilitate the upstream data transfer to the server, while the

downstream medium is used for both data broadcasting and unicasting. In this network type,

the available downstream bandwidth is logically divided into two parts, a commonly shared

broadcast channel to facilitate data dissemination and multiple dedicated point-to-point chan-

nels to respond to user requests. An exemplary network architecture would again be a hybrid

satellite system that comprises a one-way terrestrial communication path to the server and a

logically divided downstream satellite channel as described above.

To be able to abstract from issues such as which objects to select for broadcasting and how frequently to disseminate them as well as how much downstream bandwidth is to be dedicated to broadcasting and unicasting, we make the following assumptions within this thesis:

1. We opt for the separate broadcast and unicast media network model as data communication

model of choice since the broadcast and point-to-point channels are physically independent

and a fixed amount of bandwidth is assigned to each of them. By doing so, we do not face the

problem of dynamically calculating the optimal bandwidth allocation between the broadcast

and point-to-point channels. 3.2. Hybrid Data Delivery Networks 41

2. To exempt ourselves from the non-trivial issue to decide which objects to broadcast and

how often, we assume that the data access pattern follows the well-known 80:20 distribution

pattern, i.e., 80% of the data requests are directed to 20% of the data objects, and that broad-

casting the most popular 20% of the data objects yields close-to-optimal request response

times.

3. We assume a static client access behavior, i.e., we are not required to continuously adjust the

broadcast contents to react to changes in the client request pattern.

4. Last, but not least, we assume reliable communication connections and, therefore, do not

consider the effects of transmission errors.

In order to successfully deploy a hybrid data delivery system, a number of performance-critical and other crucial design issues need to be addressed, including: How to enable clients to synchro- nize with the channel and to interpret the data currently broadcast? How to organize the broadcast channel such that the average data access time can be minimized and the requested data can be fil- tered from the channel in an energy-efficient way? The answers to those questions will be provided in the remainder of this chapter.

3.2.1 Organizing the Broadcast Program

Broadcast program or structure refers to the content and organization of the broadcast and is one of the most fundamental issues in data dissemination since it decides what, in which order, and when to transmit from the server to the clients and thus, has a significant impact on the overall system performance. The ultimate goal is to minimize the data access latency while keeping the power consumption incurred by tuning to the broadcast channel at (almost) the minimum. However, before we elaborate on specific broadcast structures, we discuss the general building blocks of the conceptual design of the broadcast program.

The smallest logical unit of the broadcast program is called a bucket or frame [103], which physically consists of a fixed number of packets, the basic unit for transmitting information over the network. We distinguish between three different types of buckets: (a) concurrency control report 42 Chapter 3. Hybrid Data Delivery

(CCR) buckets, (b) index buckets, and (c) data buckets. As the name implies, CCR buckets contain

CC-related information which enable mobile clients to continously pre-validate on-going read-only and read-write transactions and to (pre-)commit transactions as they have completed their execu- tions. Additionally, CCRs help mobile clients to maintain cache consistency without continously monitoring the broadcast. Index buckets provide mobile users with information about the arrival times of data objects on the broadcast channel. By accessing the index, mobile clients are able to predict the point of time when their desired objects appear in the channel. Thus, they can stay in the energy-efficient doze mode while waiting and tune into the broadcast channel only when the data object of interest arrives. Note that a typical wireless PC card like ORiNOCO consumes 60 mW during the doze mode and 805 − 1,400 mW during the active mode [154] and thus, air indexing can facilitate considerable energy conservation. Data buckets store a number of data objects and each object is identified by a unique object identifier (OID), which is independent of the value of its attributes. The OID can be used as search key to find and identify an object in the broadcast channel. For reasons of practicability and for cost efficiency considerations, multiple buckets of the same type are placed together in the broadcast program and we refer to such a set of contigu- ous buckets as a segment. Consequently, the broadcast program is made up of the following three segment types: (a) CCR segments, (b) index segments, and (c) data segments.

In order to allow mobile clients to interpret the data instantly as they fly by and to synchronize with the broadcast channel at any time, buckets are designed to be self-explanatory by including the following header information in each bucket: (a) the bucket ID (BID), (b) the bucket type, (c) the offset to next index segment, and (d) the offset to the next major broadcast cycle (see Definition 5 below). To facilitate the subsequent discussion and to provide plausible motivations for the various design choices of the broadcast program, we provide the following definitions:

Definition 1 (Air-Cache Data Access Time (ADAT)). ADAT is the duration of time starting from the moment when the client wants to fetch an object from the broadcast channel or air-cache to the point when the desired object is actually downloaded into the local cache. ADAT can be logically split into the air-cache index probe time and air-cache wait time that are defined next.  3.2. Hybrid Data Delivery Networks 43

Definition 2 (Air-Cache Index Probe Time (AIPT)). AIPT is the period of time between probing the broadcast channel for information on the broadcast time of the next index segment and getting to the next index segment in order to obtain information on the position of the requested object in the broadcast program. 

Definition 3 (Air-Cache Wait Time (AWT)). AWT refers to the time interval between inspecting the first index bucket of the index segment and downloading the requested object from the broadcast channel. 

Definition 4 (Air-Cache Tuning Time (ATT)). ATT is the amount of time of ADAT that the client spends in active mode and listens to the broadcast channel to find the position of the required object and to download it. ATT is proportional to the power consumption of the mobile client. 

Definition 5 (Major Broadcast Cycle (MBC)). MBC is the amount of time it takes to transmit all data objects scheduled for broadcasting at least once. 

Definition 6 (Minor Broadcast Cycle (MIBC)). An MBC may be further partitioned into a num- ber of MIBCs. Each MIBC contains a sequence of objects with non-decreasing values in the OIDs, begins with a CCR segment, and is likely to be followed by an index segment. 

In the literature, a plethora of broadcast structures have been proposed [4, 75, 77, 112] that can be distinguished along two dimensions: (a) flat vs. skewed organization and (b) single vs. multiple channel organization. A flat broadcast organization is the simplest way to generate a broadcast program and is characterized by the fact that each data object appears exactly once per MBC.

An example structure of a flat broadcast is shown in Figure 3.1(a) on page 48 and its underlying assumptions are that the data file disseminated by the broadcast server consists of 18 data objects of equal size and access probability and each bucket in the broadcast accommodates exactly one of these objects. Additionally, we assume that one index bucket can capture indexing information of 6 data objects and the entire index — rather than only a portion of it — is broadcast between successive data segments. For pedagogical and comparison reasons, we retain these assumptions for the presentation of the other broadcast structures discussed in the rest of the subsection and we 44 Chapter 3. Hybrid Data Delivery may refine them as demand arises. The design of the structure illustrated in Figure 3.1(a) is driven by the objective to provide clients with the ability to selectively tune into the broadcast channel (by interleaving an index with the data) and to allow them to efficiently validate their local transactions and to enforce cache coherence in an energy-conserving way (by interspersing a CCR with the data). As both the CCR and index segment is broadcast once during an MBC, the average AIPT is equal to half the distance between two consecutive index segments which is about equal to half the total length of an MBC. The AWT consists of inspecting the index and waiting for the object to occur on the channel plus downloading it.1 On average, this is equal to the length of the index segment, lindex, plus half the length of the data segment, ldata, and the time to download the desired

1 object, lob ject , i.e., lindex + 2 ·ldata +lob ject . As a result, we can derive that the average ADAT is about equal to the length of one MBC which may be unsustainably high especially if the MBC becomes large.

To avoid such high data access latencies, an alternative method is to broadcast the index seg- ment, possibly coupled with the CCR segment, multiple (i.e., dindex) times within an MBC as shown in Figure 3.1(b). Then, the average AIPT is only half of the sum of the length of the CCR segment,

1 ldata lCCR, lindex, and the quotient of ldata and dindex, i.e., · (lCCR + lindex + ). Obviously, the more 2 dindex index segments are interleaved with the data segment, the shorter the average AIPT. However, as a side effect of interspersing multiple index segments (and possibly CCR segments) into the broad- cast program, the size of the MBC increases, which, in turn, translates into a longer average AWT.

Therefore, the essential issue is to find an optimal value of dindex so as to minimize the average ADAT which is achieved by using the following formula [77]:

s ldata dindex = . (3.1) lindex + lCCR

Both broadcast organizations described above are flat broadcast schemes and are only optimal if every data object is accessed uniformly which is rarely the case in real life. Typically, the probability distribution of client accesses is highly skewed [68], i.e., some data objects are more important and are more commonly requested by the clients than others. Skewed broadcast schemes [4, 7, 60,

1Note that we assume that all object requests refer to data objects included in the broadcast program. 3.2. Hybrid Data Delivery Networks 45

148, 153] take account of this fact and broadcast more popular data objects more frequently which results in broadcast programs in which some objects may appear more than once within a MBC.

Acharya et al. [4, 7] were the first to propose a non-uniform data broadcast, called “multi-disk” broadcast generator. The multi-disk broadcast generator splits the data file into n partitions where each partition consists of objects with the same or similar access probability. Each partition is then assigned to a separate broadcast disk i which spins with its own relative frequency λi. The spinning speeds of the individual disks are set in proportion to the average access probability of the objects within the various partitions and the partitions’ sizes are multiples of 1. Now lets λ denote the least common multiple of λi, for all i. The multi-disk broadcast generator additionally splits the

λ contents of each broadcast disk into ci chunks, where ci = . The broadcast program is then built λi by interleaving chunks from the various broadcast disks by using the following algorithm:

1 begin 2 for i ←− 0 to λ − 1 do 3 for j ←− 1 to n do 4 broadcast chunk (i mod ci) of broadcast disk j

5 end

Algorithm 3.1: Multi-disk broadcast generation algorithm.

To exemplify the multi-disk data broadcast approach within the framework of our running ex- ample, we need to modify its underlying assumption that all objects are equally likely to be ac- cessed. Obviously, this assumption is not appropriate to generate a skewed broadcast and we there- fore assume a non-uniform data access behavior. More precisely, data objects 1 − 4, 5 − 10, and

6 4 3 11−18 are assumed to be accessed with a probability p of 13 , 13 , and 13 , respectively. Given those groups of objects with different access probabilities, the multi-disk broadcast generator creates a

3-disk broadcast program with broadcast disks BD1, BD2, and BD3 consisting of data objects 1−4, 5 − 10, and 11 − 18, respectively. To take account of the varying access probabilities among the object groups, objects in BD1 are broadcast one and a half times as often as objects on BD2 and twice as frequently as objects on BD3. Thus, the broadcast generator spins disks BD1, BD2, and

BD3 with a relative frequency of λ1 ←− 6, λ2 ←− 4, and λ3 ←− 3. Consequently, the contents 46 Chapter 3. Hybrid Data Delivery of BD1, BD2, and BD3 will be split into 2, 3, and 4 chunks, respectively and each chunk consists of 2 objects on all disks. Figure 3.1(c) finally shows the resulting multi-disk broadcast program for the running example. The figure also illustrates a valuable property of the multi-disk broad- cast algorithm: the inter-arrival time between successive broadcasts of the same object is fixed.

Note that the generation of regular broadcasts is important for mobile clients for reasons such as to simplify the client caching and prefetching heuristics or to retrieve data objects from the broadcast channel without (always) consulting the index. However, the regularity property does not come without cost. The problem is that the broadcast frequency and, therefore, the amount of broadcast bandwidth allocated to any data object does not properly reflect its access probability. As derived in [15], in an optimal broadcast program, the amount of bandwidth allocated to any object should be proportional to the square-root of its access probability. In our running example and as reflected in the spinning speeds of the 3 broadcast disks, the access probabilities of the objects “stored” on disks

6 4 3 BD1, BD2, and BD3 are 13 , 13 , and 13 , respectively. The square-root formula for optimal bandwidth allocation prescribes that disks BD1, BD2, and BD3 should get 40%, 33%, and 27%, respectively. However, the multi-disk scheduling approach gives the same bandwidth to all 3 disks, i.e., popular data objects (1 − 4) are given to little bandwidth and non-popular data objects (11 − 18) are given to much bandwidth. In the literature there exist alternative approaches that optimize the multi-disk approach and provide close to optimal bandwidth allocation to data objects [60,148,153]. However, those approaches trade-off broadcast program regularity for better bandwidth allocation and may therefore not be the first choice of the system designer.

The broadcast structures discussed so far assumed that the broadcast server disseminates in- formation over a single channel and all clients are tuned to this channel. An alternative approach is to conceive an environment, in which the server broadcasts popular data on multiple channels, and clients listen to one or more channels in parallel depending on the physical properties of the mobile device. In this respect, however, we argue that there is no need to model separate channels for data dissemination as long as the data is accessed uniformly. If that is the case, it does not matter whether we multiplex the index and its underlying data on a single or on multiple channels as long as CC-related and index data is still broadcast with the same frequency. The reason is that 3.2. Hybrid Data Delivery Networks 47 the combined capacity of multiple channels is equivalent to the capacity of a single channel and we can always find a mapping from a single channel broadcast program to a multi-channel broadcast program that provides the same performance results.

This finding, however, may not be true if the access pattern on the data objects is skewed.

Then, multiple channel broadcasting may provide a performance advantage over the single channel approach. This is because the multi-channel broadcast approach allows the interleaving of multiple small indexes with data objects within each of the multiple channels rather than a single large index for an entire broadcast channel. If we now assume that data objects with the same or similar data access probabilities are broadcast over the same broadcast channel, the indexes are built upon objects with the same (or similar) popularity. Consequently, clients may not any more waste time inspecting index entries of unpopular data objects despite the fact that the requested data object is popular. Figure 3.1(d) finally illustrates a suitable multi-channel broadcast program for our running example. As in previous broadcast organization structures, the program’s underlying assumption is that each index bucket can accommodate index information of up to 6 objects and thus, the index segments of broadcast channels 1 and 2, denoted IS1,n and IS2,n, respectively, where n represents some consecutive index segment number with n ≥ 1, contain only one index bucket. Index segment

IS3,n is double the size of IS1,n and IS2,n, respectively, as it needs to accommodate index information of 8 objects.

3.2.2 Indexing the Broadcast Program

So far, we have not discussed specific channel indexing techniques that help mobile clients to ef-

ficiently find out what is being broadcast at what time. Before doing so, we briefly discuss rea- sons why an index should be an integral part of any broadcast program of a hybrid data delivery system: (a) The first reason is related to the key issue of saving scarce battery power of mobile devices. Without an index, mobile clients need to continously monitor the broadcast channel until the desired data object arrives. This consumes about an order of magnitude more energy as mobile clients need to remain in active mode all the time than if an air index was interleaved with the corre- sponding data buckets and mobile clients could stay in doze mode during waiting time and tune into 48 Chapter 3. Hybrid Data Delivery

Broadcast Cycle

1 4 7 10 13 16 2 5 8 11 14 17 CCR Index

Segment Segment 3 6 9 12 15 18

Data Segment (a) Flat single channel broadcast organization — (1,1) indexing

Major Broadcast Cycle

Minor BC1 Minor BC2 2 2 1 1 1 4 7 10 13 16 2 5 8 11 14 17 CCR CCR Index Index Segment Segment Segment 3 6 9 Segment 12 15 18

Data Segment1 Data Segment2 (b) Flat single channel broadcast organization — (1,2) indexing

Major Broadcast Cycle Minor BC1 Minor BC2 2 2 1 1 1 6 3 8 1 10 3 6 2 11 4 13 2 15 4 17 CCR CCR Index Index

Segment Segment 5 12 7 14 Segment Segment 9 16 5 18

Data Segment1 Data Segment2 (c) Skewed single channel broadcast organization — (1,2) indexing

Major Broadcast Cycle Minor BC1 Minor BC2 2 Channel 1 1 IS1,1 1 2 IS1,1 3 4 IS1,2 1 2 IS1,2 3 4

Channel 2 IS2,1 5 6 IS2,1 7 8 IS2,2 9 10 IS2,2 5 6 CCR CCR Segment Segment Channel 3 IS3,1 IS3,1 11 12 13 14 IS3,2 IS3,2 15 16 17 18

Index and Data Segment1 Index and Data Segment2 (d) Skewed multi-channel broadcast organization — Multi-channel indexing

Figure 3.1: Various possible organization structures of the broadcast program. 3.2. Hybrid Data Delivery Networks 49 the channel only when the desired object arrives. (b) Second, air indexing may help mobile clients to reduce their average ADAT. At first glance, this point may seem to be contradictory as the broad- cast cycle is lengthened due to the additional indexing information clearly leading to longer average

ADATs for those objects which are air-cache-resident. However, in a hybrid data delivery systems the air-cache is not the only available source to mobile clients for data retrieval. For performance reasons, only the hottest database objects are continuously broadcast to the client population and the rest of the database can be requested from the server as demand arises. To exploit the scalability advantages of data broadcasting and to avoid the mobile network and server to become the perfor- mance bottleneck, mobile clients should always listen to the broadcast first to find out whether the object of interest is air-cached, before sending a request to the server asking for it. To enable clients to quickly differentiate air-cache hits from air-cache misses, an air-cache index is indispensable.

If no index is interleaved with the data objects and the object of interest is not air-cached, then a mobile client needs to wait an entire MBC to find this out. An air-cache index can considerably speed up that process resulting in much shorter average ADATs for non-air-cache-resident objects.

Note, however, that the question whether an air-cache index reduces or increases the overall aver- age ADAT is system-specific and depends on many tuning and workload parameters such as the repetition frequency of the index, the relative size of the index compared to its corresponding data, the number of data objects being broadcasted, the data access patterns of the clients, the average load on the network and database server, etc.

Irrespective of whether indexing bears a performance trade-off or not, it helps to conserve the usage of energy and is therefore an indispensable technique for hybrid data delivery systems. In the literature, we can distinguish between three classes of index methods for broadcast channels:

(a) signature-based, (b) hashing-based, and (c) tree-based indexing. In what follow, we briefly describe the basic working principles of the techniques and comparatively evaluate them w.r.t. the two key performance metrics, namely access latency and tuning time. 50 Chapter 3. Hybrid Data Delivery

Signature-based Indexing

The signature method is a widely used method with applications in areas such as text retrieval [48], multimedia database systems [130], image database systems [100], and conventional database sys- tems [34]. Signatures are densely-encoded information about data objects which are significantly smaller than the actual object itself (< 20%). They are easy to calculate and provide a high degree of selectivity. A signature of a data object i, denoted Si, is a bit vector generated by first hashing each attribute value of the data objects into a bit sequence and then superimposing or ORing the bit se- quences together. Signatures are (periodically) calculated by the broadcast server for any scheduled data object and are typically broadcast as a group either once or preferably multiple times within an

MBC. To determine whether the requested object might be contained within the broadcast program, a query signature Squery is constructed by the same hash function as used by the broadcast server for the objects, denoted Sbcast , and is then compared to each Si in Sbcast . As a result of the comparison a candidate list of data objects is returned containing those objects of the broadcast program that match the query signature, i.e., Squery ∧ Si = Squery. Each object signature and the OID of its corresponding object is associated with the information where to find the respective object in the broadcast program and is stored as a triple (Si,OID,BID) in Sbcast . Once the candidate objects are determined, the objects must be compared directly with the search criteria after the object signa- ture indicates a match in order to eliminate the so-called false drops that may occur due to reasons such as hash collisions and/or disjunctively combining various signature terms. An example of a signature file along with a sample query is given below:

Query: Select * From Ticker Where Symbol =’MSFT’;

10100010 Squery

Si OID BID OID Symbol Price Comparision Result 00001011 1 1 1 IBM 86.75 S1 Squery No Match 10101011 2 1 2 MSFT 27.5 Bucket 1 S2 Squery Match 10000011 3 1 3 ORCL 11.2 S3 Squery No Match ...... S4 Squery False drop 11110011 N M ...... Signature File ...... Bucket M N SUNW 4.10 Data Buckets

Figure 3.2: Example illustrating the signature comparison process. 3.2. Hybrid Data Delivery Networks 51

In this example the predicate Symbol = ’MSFT’ has a signature of 10100010 and the signature comparison indicates a match for objects 2 and 4 and rejects the other two objects. After accessing and inspecting objects 2 and 4 in the broadcast stream, only object 2 matches the search condition and object 4 is identified as false drop.

Signature schemes proposed in the literature are the (a) simple, (b) integrated, and (c) multi- level signature methods [103]. In the simple signature scheme, a signature bucket is constructed for each data bucket and broadcast before it. The integrated signature scheme generalizes the simple scheme by generating a signature bucket for a group of one or more data buckets. As for the simple scheme, the signature bucket is disseminated before the corresponding data buckets. The multi-level signature scheme consists of multiple signature levels with each level being broadcast before its corresponding data buckets. The multi-level scheme combines the simple and integrated signature methods by using the former to generate the lowest level signatures and the latter to calculate upper level signatures.

Before concluding the discussion about signature-based air indexing, we specify the algorithm a mobile user uses to locate and reach data objects for query signature Squery within the air-cache if the integrated signature scheme is used for indexing. The algorithm as shown below, is executed as soon as a required data object is chosen by the mobile user and a local cache miss occurred.

1 begin /* Initial probe */ 2 Tune into the broadcast channel, read the header part of the current bucket Bcurr and find out when the next signature bucket will be broadcast. 3 Go into doze mode and wake up when the signature bucket is broadcast. /* Integrated signature probe */ 4 foreach Integrate signature of an entire broadcast cycle do 5 if Squery matches the integrates signature then 6 Check all data buckets associated with the signature for true signature matches and download them from the air-cache. 7 else 8 Turn into doze mode and wait until the next integrate signature is broadcast.

9 end

Algorithm 3.2: Access protocol for retrieving data objects by using the integrated signature scheme. 52 Chapter 3. Hybrid Data Delivery

Hashing-based Indexing

Hashing-based schemes differ from tree-based and signature-based indexing schemes by the fact that they do not require separate indexing information to be broadcast with the data. Rather hashing parameters are included in the header part of each data bucket. To help mobile users to orien- tate themselves within the broadcast stream and to enable them to determine the position of the desired object in the broadcast program, each data bucket header contains the following informa- tion: (a) BID, (b) offset to the next MBC oMBC, (c) hash function h, and (d) shift value s. The shift value is a pointer to the logical bucket B containing data objects with the hash key k such that h(k) = bucket id(B), where bucket id denotes a function returning the BID of the bucket B.

The shift value is required since hash functions may not be perfect which means that there will be hash collisions and fractions of the colliding data objects may need to be stored in overflow buckets which immediately follow the actual bucket assigned to them by the hash function.2 As a result and with the exception of the first logical data bucket in the broadcast program, the other logical broadcast buckets might need to be shifted further down in the broadcast cycle which requires each logical bucket to store redirection information in form of a shift value eventually guiding the user to the true logical bucket.

After these preliminary remarks, we are in the position to discuss the Hashing A data access protocol as introduced in [76]. The access protocol involves the steps as presented in Algorithm 3.3.

To get a better understanding of the Hashing A access protocol, Figure 3.3 illustrates a simple application scenario where a mobile user wants to locate an object x with search key k = 18 in the broadcast channel. The hash function used to map objects to logical data buckets is h(k) = k mod 5.

In Figure 3.3, buckets that have no fill style denote logical buckets and those that are filled with dots denote overflow buckets of the preceding logical bucket. Besides, the numbers in the left and right hand corner of the bucket headers denote the bucket identifier and shift value, respectively. In the example we assume that the initial probe takes place at the 2-nd physical bucket (BID = 1). The client probes this bucket and reads the bucket identifier and hash function from its bucket header

2Note that grouping all overflow buckets at the end of a broadcast program with each logical bucket having a pointer to its first bucket of the overflow chain would be an alternative hashing method yielding comparable tuning and data access times to that of the Hashing A method described below. 3.2. Hybrid Data Delivery Networks 53

1 begin /* Initial probe phase starts here */ 2 Tune into the broadcast channel, read the header part of the current bucket Bcurr and calculate h(k). 3 if bucket id(Bcurr) < h(k) then 4 Turn into doze mode and wait until bucket h(k) appears on the channel. 5 else 6 Turn into doze mode and wake up at the beginning of the next MBC. /* First probe phase commences here */ 7 if bucket id(Bcurr) = h(k) then 8 Read the shift value from the header section of the current bucket scurr and go into doze mode. 9 if scurr > 0 then 10 Wake up after scurr number of buckets. 11 else 12 Stay tuned to the broadcast channel. /* Final probe phase starts here */ 13 Listen to the broadcast channel until either an object with search key k is encountered or an object with search key l is observed such that h(l) 6= h(k). 14 end

Algorithm 3.3: Data access protocol of the Hashing A scheme. and verifies whether the bucket identifier is smaller than the hash value of the search key 18. Since this condition holds, the client turns into doze mode and wakes up to proceed with the first probe at the 4-th physical bucket (BID = 3). If there was no overflow, this bucket would contain the candidate data objects that might match the search key 18. However, since there is overflow, the data objects which are mapped to the logical bucket 4 (i.e., h(k) = 3) are shifted by 7 buckets further down in the broadcast stream. Again, the client goes into doze mode and continues with the

final probe at logical bucket 4 (BID = 10). In order to find out which data objects with search key

18 are actually present in the broadcast program, the client finally needs to examine logical bucket

4 and its associated overflow bucket (i.e., BID = 11) for query matches.

In Figure 3.3 we considered the case where the client tunes into the broadcast channel (to locate data objects with search key k) at the physical bucket Bcurr and its associated bucket identifier is smaller than h(k), i.e., guiding or directory information to direct the user to the required data objects can still be obtained from the current broadcast cycle. However, if that is not the case, i.e., 54 Chapter 3. Hybrid Data Delivery

Initial Probe Phase

BID oMBC h(k) s Header 0 +15 k mod 5 +0

0 +0 1 +2 2 +6 3 +7 4 +8 5 6 7 8 9 10 11 12 13 14 Data Bucket

Initial Probe First Probe Phase 0 +0 1 +2 2 +6 3 +7 4 +8 5 6 7 8 9 10 11 12 13 14

First Probe Final Probe Phase 0 +0 1 +2 2 +6 3 +7 4 +8 5 6 7 8 9 10 11 12 13 14

Final Probe

Figure 3.3: An example illustrating the Hashing A data access protocol by using h(k) = k mod 5 as hash function. the client makes its initial probe after the bucket storing either the desired data objects themselves or, alternatively, if overflow exists, a pointer to the first of a chain of buckets keeping them, the client needs to wait until the next MBC to find out about their presence in the broadcast program.

Obviously, in the overflow case, missing the bucket containing a pointer to the true logical bucket may become costly as it requires the user to wait until the next MBC to locate the desired data objects. Its costs is proportional to the ratio of the number of overflow buckets within the broadcast program to the number of their corresponding logical buckets and can be reduced by using the

Hashing B data access protocol [76], refining the Hashing A scheme by modifying the hashing function h(k) such that it takes into consideration the minimum size of the overflow chain of the logical buckets. By doing so, the probability of so-called directory misses can be reduced, and thus, the average ADAT of the Hashing B protocol may be significantly lower than that of the Hashing

A scheme [76]. For more information on the Hashing B protocol and a comparison study of both schemes, we refer the interested reader to [76].

Tree-based Indexing

Last, but not least, various tree-based indexing schemes have been proposed in the literature to ad- dress the power conservation issue of broadcast channels [35, 75, 76, 77, 142, 167]. Among those 3.2. Hybrid Data Delivery Networks 55 proposed tree-based techniques, the (1,m) indexing method [75, 77] is one, if not the most promi- nent representative, of this category and is therefore briefly described in what follows. Like many other tree-based indexing schemes, the (1,m) indexing method applies a B+-tree for air indexing and broadcasts the whole B+-tree m times during the transmission of a single instance of the data

1 file. The index is broadcast at the beginning of every m fraction of the data file. Figure 3.4(a) exemplifies how Imielinski et al. adapt the B+-tree indexing technique for air indexing by storing the arrival times of the data objects in the leaf nodes of the tree. The figure shows a B+-tree of order

1 and height 2 which indexes a data file (consisting of data objects stored in 18 buckets) along a single attribute. In the figure the rectangular boxes in the bottommost level depict the data file and each box represents a collection of 2 data buckets. The B+-tree is shown above the data buckets and each leaf node bucket has 2 pointers to its associated data buckets. Obviously, the entries of the leaf node buckets are (key value, data bucket pointer) pairs, while non-leaf node buckets contain (key value, index bucket pointer) pairs.

Besides constructing the B+-tree3 of the data objects to be broadcast, we need a simple way of mapping the index tree to the channel-time space. We do so by traversing the B+-tree in a top- down, leaf-to-right fashion and map each index bucket to the broadcast channel-time space in the order of its selection. In Figure 3.4(b) we see an example of how to map the index buckets of the

B+-tree shown in Figure 3.4(a) onto the channel-time space. For reasons of clarity, we represent the index buckets in Figure 3.4(b) using alphanumeric numbers rather than the values of the keys as in

Figure 3.4(a). Figure 3.4(b) also illustrates how the user is guided through the index when searching for a required data object. The example assumes that a data object with key 18 is to be accessed by the user. The index traversal starts after being routed to the root bucket R of the index tree. The user probes R and is directed to index bucket I2 by the search key comparison and then to the leaf bucket L5. At each index probe, the user obtains the time offset when the next required child index node is transmitted enabling it to switch into doze mode between consecutive index probes.

When building the B+-tree, the construction algorithm will need to know for each index node when and where (in the multi-channel channel set-up) the data or index buckets it refers to are to

3B+-trees can either be constructed by repeatedly applying the B+-tree insertion algorithm [22,93] or if the tree needs to be built from scratch by using the more efficient batch-construction algorithm [91]. 56 Chapter 3. Hybrid Data Delivery

Root B+−Tree R 15 23 Level 2

I1 6 11 I2 18 21 I3 28 32 Level 1

1 4 6 8 11 12 15 16 18 19 21 22 23 26 28 31 32 34 Level 0 L1 L2 L3 L4 L5 L6 L7 L8 L9 Data Buckets 0,1 2,3 4,5 6,7 8,9 10,11 12,13 14,15 16,17 (a)

R I1 L1 L2 L3 I2 L4 L5 L6 I3 L7 L8 L9 (b)

Figure 3.4: Tree-indexed broadcasting: (a) an example B+-tree and (b) index probing scenario to the data object with the key 18. be transmitted so that it may store logical pointers to them. The so-called pointer filling can be performed as follows:

+ Let ni denote the number of index buckets at the i-th level, level(i), of the B -tree,

0 ≤ level(i) ≤ n and ji, j > 0, denotes the j-th index bucket at the i-th index level in left-to-

+ right order. Additionally, let height be the height of the B -tree and let nindex denote the number of index buckets required to store the entire B+-tree. The index bucket at position p(i, j) will then periodically be broadcast at time:

height ! index Ti, j = Ts + (k + 1) · ∑ nl + j + k · (nindex + nnon−index), k ∈ N, (3.2) l=d, d>level(i) where Ts represents the (logical) time at which the server begins broadcasting data (initially, Ts is

0) and nnon−index denotes the number of non-index buckets disseminated between two consecutive index files. Note that in order to keep Equation 3.2 as simple as possible, the (1,m) index is assumed to be broadcast first within a MIBC.

+ Let m denote the number of times the whole B -tree is broadcast during a MBC, nCCR denotes the number of CCR buckets reserved for CC-related information per MIBC, and ndata denotes the number of data buckets required to store a single instance of the data file which is assumed to be a 3.2. Hybrid Data Delivery Networks 57 multiple of m. Let nbcast further denote the length of a MBC in terms of buckets. Then, the periodic time when the i-th data bucket, 0 ≤ i < ndata, is broadcast can be computed by using the following formula:

   data i · m  ndata  ndata Ti = Ts + (k + 1) · · nindex + nCCR + + nindex + nCCR + i mod + 1 + ndata m m

k · nbcast , k ∈ N. (3.3)

To complete the description of the (1,m) indexing algorithm, we finally specify the data access algorithm the mobile user uses to reach the desired data object(s) (see Algorithm 3.4). As before, the algorithm is executed as soon as the requested data object(s) are chosen and they are identified as non-cache-resident at the client.

1 begin /* Initial probe */ 2 Tune into the broadcast channel, read the header part of the current bucket Bcurr and find out when the next index tree bucket will be broadcast. 3 Go into doze mode and wake up when the first bucket (root node bucket) of the next index segment is broadcast. /* Index probe */ + 4 Traverse the B -tree by successively probing non-leaf nodes until the leaf node in which the search key belongs is found. 5 Power down the mobile device between successive index node probes to conserve energy. 6 if Search key k matches any leaf node entry of the tree then /* Air-Cache Hit */ 7 Tune into the broadcast channel when the first data bucket containing data objects with attribute value k is disseminate and download all data objects with matching attribute value k. 8 else /* Air-Cache Miss */ 9 Request the required data objects with attribute value k directly from the server via point-to-point communication 10 end

Algorithm 3.4: Access protocol for retrieving data objects by using the (1,m) indexing scheme.

Besides the (1,m) indexing technique, in the literature there exists a plethora of other tree- based indexing methods with some of them being summarized hereafter. The distributed indexing 58 Chapter 3. Hybrid Data Delivery method [75, 77] has been proposed to cut down the high number of index buckets replicated by

(1,m) indexing within a broadcast cycle. To do so, the index tree is divided into a replicated and non-replicated part with the latter being broadcast only once in a MBC. This is possible as the non- replicated index part always indexes only the data objects immediately following the distributed index dissemination and hence, it is able to reduce the relatively high replication and access latency costs of the (1,m) indexing technique, while still achieving good tuning time.

Both (1,m) indexing and distributed indexing assume that client data accesses are uniformly distributed. In practice, this is hardly the case and therefore, unbalanced tree structures that optimize the tuning time for non-uniform data accesses have been suggested [35,142]. More precisely, k-ary

Alphabetic Huffman trees were proposed that minimize the average index search cost by reducing the number of index node probes for data objects with high access probabilities at the expense of spending more on those with a low popularity. To allow system designers to trade-off between access latency and tuning time based on the respective application-specific requirements, the flexible and exponential indexing method has been proposed [76, 167]. Both methods achieve this goal by providing the system designer with tuning parameters that offer great flexibility in trading access time against tuning time and vice versa.

An important question that has not been address so far is how the previously discussed indexing methods perform against each other in terms of ADAT and ATT. Unfortunately, and to be best of our knowledge, there is no study in the literature comparing all three indexing classes. We are therefore restricted in our comparative evaluation to the performance results gathered by two independently conducted performance studies which compare the signature-based and tree-based indexing methods [69], and hashing-based and tree-based indexing methods [76, 155] in terms of of ADAT and ATT. In what follows, we present a brief synopsis of the experimental results of both studies starting with key observations drawn from comparing the Hashing B with the flexible indexing method [76,155]:

• Both the tree-based indexing and hash-based indexing techniques have some advantages over

the other. 3.2. Hybrid Data Delivery Networks 59

• Tree-based indexing techniques should be used if energy conservation is the main application

requirement and the key size is relatively small compared to the size of the data objects.

• Hash-based indexing techniques should be used if energy efficiency is of minor importance

for the application scenario and the key size is relatively large compared to the size of the

data objects.

The comparison study of the simple signature and distributed indexing methods [69] provided the following results:

• Both the tree-based indexing and signature-based indexing techniques have some advantages

over the other.

• If the access time is more important for the usage scenario than the tuning time, use signature-

based indexing.

• If the tuning time, i.e., the power conservation, is more important for the envisaged applica-

tion scenario than the access time, use tree-based indexing.

As a result of those two comparison studies and due to the fact that the optimization of the access latency and tuning time metrics are contradictory goals, we can conclude that none of the three indexing methods is superior to any of the others in terms of both access and tuning time. Addi- tionally, the results show that if tuning-time is the most important system design factor, tree-based indexing techniques should be used. However, to achieve good access times, either signature-based or hashing-based indexing method should be deployed. 60 Chapter 3. Hybrid Data Delivery “Ten years from now, all transactions

may be done wirelessly.”

– Jeff Bezos CEO, Amazon.com

Chapter 4

Processing Read-only Transactions Efficiently and

Correctly

4.1 Introduction

Consider innovative applications such as road traffic information and navigation systems, online auction and stock monitoring systems, news/sport/weather tickers, etc. that may employ broadcast technology to deliver data to a huge number of clients. As those applications primarily consume data, rather than produce it, the majority of them that need data consistency and currency guar- antees initiate read-only transactions. Running such read-only transactions efficiently despite the various limitations of a wireless broadcasting environment is addressed in this chapter. How trans- action processing can be implemented and how data consistency and currency can be guaranteed, is constrained, among others, by the limited communication bandwidth of mobile networks. To- day’s wireless network technology such as cellular or satellite networks offers a client-to-server bandwidth that is still severely restricted (see Table 2.1 for detailed information on the bandwidth characteristics of various mobile networks). Fortunately, the server-to-client bandwidth is often much higher than in the opposite direction and thus makes the broadcasting paradigm an attractive choice for data delivery and ensures, as shown in this chapter, that read-only transaction processing

61 62 Chapter 4. Processing Read-only Transactions Efficiently and Correctly algorithms can be implemented efficiently.

4.1.1 Motivation

Irrespective of the environment (central or distributed, wireless or stationary) in which read-only transactions are processed, they have the potential of being managed more efficiently than their read-write counterparts especially if special concurrency control (CC) protocols are applied. Multi- version CC schemes [109, 159, 162] appear to be ideal candidates for read-only transaction pro- cessing in broadcasting environments since they allow read-only transactions to execute without any interference with concurrent read-write transactions. If multiple object versions are kept in the database system, read-only transactions can read older object versions and, thus, never need to wait for a read-write transaction to commit or to abort in order to resolve the conflict. As with read-write transactions, read-only transactions may be executed with various degrees of consistency. Choos- ing lower levels of consistency than serializability for transaction management is attractive for two reasons: (a) The set of correct multi-version histories that can be produced by a scheduler can be in- creased and, hence, higher performance (i.e., transaction throughput) can be achieved. (b) Weaker consistency levels may allow read-only transactions to read more recent object versions. Thus, weaker consistency levels trade-off consistency for throughput performance and data currency.

While reading current (or at least “close” to current) data is necessary for read-write transactions to preserve database consistency during updates, such requirements are not necessary for read-only transactions to be scheduled in a serializable way. That is, read-only transactions can be executed with serializability correctness even though they observe out-of-date database snapshots. To prevent read-only transactions from seeing database states that are too old, thus causing users to experience transaction anomalies related to data freshness, we need well-defined isolation levels (ILs) which guarantee both data consistency and data currency to read-only transactions. The ANSI/ISO SQL-

92 specifications [14] define four ILs, namely Read Uncommitted, Read Committed, Repeatable

Read, and Serializability. Those levels do not incorporate any currency guarantees, though, and thus are unsuitable for managing read-only transactions in distributed mobile database environments.

Theory and practice have pointed out the inadequacy and imprecise definition of the SQL 4.1. Introduction 63

ILs [23] and some redefinitions have been proposed in [10]. Additionally, a range of new ILs were proposed that lie between the Read Committed and Serializability levels. The new intermediate ILs were designed for the needs of read-write transactions with only three of them explicitly stating the notion of logical time. One of those levels, called Snapshot Isolation (SI) [23], ensures data cur- rency to both read-only and read-write transactions forcing them to read from a data snapshot that existed by the time the transaction started. Oracle’s Read Consistency (RC) level [117] provides stronger currency guarantees than Snapshot Isolation by guaranteeing that each SQL statement in a transaction Ti sees the database state at least as recent as it existed by the time Ti issued its first read operation. For subsequent read operations/SQL statements RC ensures that they observe the database state that is at least as recent as the snapshot seen by the previous read operation/SQL statement. Finally, Adya [9] defines an IL named Forward Consistent View (FCV) that extends SI by allowing a read-only (read-write) transaction Ti (Tj) to read object versions created by read-write transactions after Ti’s (Tj’s) starting point, as long as those reads are consistent in the sense that Ti

(Tj) sees the (complete) effects of all update transactions it write-read or (write-read/write-write) depends on.

The above mentioned levels are not ideally suitable for processing read-only transactions for the following reasons: (a) All of them are weaker consistency levels, i.e., read-write transactions executed at any of these levels may violate consistency of the database since none of them requires the strictness of serializability. Consequently, read-only transactions may observe an inconsistent database state, if they view the effects of transactions that have modified the database in an incon- sistent manner. Inconsistent or bounded consistent reads may not be acceptable for some mobile applications, thus making non-serializability levels that do not ensure database consistency to such transactions inappropriate. (b) Another problem arises from the fact that mobile database applica- tions may need various data currency guarantees depending on the type of application and actual user requirements. The ILs mentioned above provide only a limited variety of data currency guar- antees to read-only transactions. All levels ensure that read-only transactions read from a database state that existed at a time not later than the transaction’s starting point. Such firm currency guar- antees may be too restrictive for some mobile applications. Hence, there is a need for definition of 64 Chapter 4. Processing Read-only Transactions Efficiently and Correctly new ILs that incorporate weaker currency guarantees. Moreover, we need to define new ILs that meet the specific requirements of (mobile) read-only transactions.

4.1.2 Contribution and Outline

This chapter’s contributions are as follows: (a) We define four new ILs that provide useful con- sistency and currency guarantees to mobile read-only transactions. In contrast to the ANSI/ISO

SQL-92 ILs [14] and their modifications by [23], our definitions are not stated in terms of existing concurrency control mechanisms including locking, timestamp ordering, and optimistic schemes, but are rather independent of such protocols in their specification. (b) We design a suite of multi- version concurrency control algorithms that efficiently implement the proposed ILs. (c) Finally, we present the performance results of our protocols and compare them. To our knowledge, this is the first simulation study that validates the performance of concurrency control protocols providing various levels of consistency and currency to read-only transactions in a mobile hybrid data delivery environment.

The remainder of the chapter is organized as follows: In Section 4.2, we introduce some nota- tions and terminology for use throughput this chapter. In Section 4.3, we define new ILs especially suitable for mobile read-only transactions by combining both data consistency and currency guar- antees. Implementation issues are discussed in Section 4.4. Section 4.5 reports on results of an extensive simulation study conducted to evaluate the performance of the implementations of the newly defined ILs. Section 4.6 concludes this chapter by presenting the main research results of this work.

4.2 Preliminaries

Before proposing new ILs providing strong semantic data consistency and data currency guarantees to read-only transactions, it is necessary to provide a formal framework upon which their definitions are based on. We do so by establishing our definitions of the important basic notions of database, database state, and transaction: 4.2. Preliminaries 65

Definition 7 (Database). A database D = {x1,x2,...,xi} is a finite set of uniquely identified data objects that can be operated on by a single atomic database operation and each data object xi ∈ D has a finite domain dom(xi). 

Definition 8 (Database State). A database state DS is an element of the Cartesian product of the domains of elements of D, i.e., a state associates a value with each data object in the database. More formally, DS ⊆ dom(x1) × dom(x2) × ... × dom(xi),∀xi ∈ D. 

Transactions are submitted against the database by multiple, concurrent users and it is assumed that all transaction programs are correctly formed and execute on a consistent database state and, in the absence of other transactions, will always produce a new consistent database state. In what follows, we use ri[x,v] (or wi[x,v]) to denote that transaction Ti issued a read (or write) operation on data object x and the value read from (or written into) x by Ti is v. To keep the notations simple, we assume that no transaction Ti reads and writes any object x more than once during its lifetime.

Besides, we use bi, ci, and ai to denote Ti’s transaction management primitives begin, commit, and abort, respectively, and the set of all read and write operations issued by transaction Ti is denoted by OPi. We are now ready to formally define the notion of a transaction:

Definition 9 (Transaction). A transaction is a partial order Ti = (∑i,

1. ∑i ⊆ OPi ∪ {bi,ci,ai};

2. ai ∈ ∑i iff ci ∈/ ∑i;

3. let p denote either ai or ci; for any other operation q ∈ ∑i, q

4. let p represent the transactional primitive bi; for any other operation q ∈ ∑i, p

5. if ri[x,v],wi[x,v] ∈ OPi, then either ri[x,v]

Condition (1) enumerates the operations performed by a transaction. Statement (2) states that a transaction contains either an abort or commit operation, but not both, and Point (3) guarantees that all transaction operations occur before Ti’s termination. Condition (4) says that all transaction oper- 66 Chapter 4. Processing Read-only Transactions Efficiently and Correctly ations are preceded by a begin operation and Point (5) finally ensures that read and write operations on a common data object are ordered according to the ordering relation

Definition 10 (Read-only and Read-Write Transaction). A transaction is a read-only transaction if it contains no write operations, and is a read-write transaction otherwise. 

As noted above, in practice, multiple transactions are likely to be executed in parallel against the database, i.e., operations of different transactions may be interleaved. To record the relative execution order of those operations, the transaction scheduler keeps an internal structure called a history. Informally, a history is a partial order of the executions of transaction operations where the operation ordering within transactions is preserved and all pairs of operations that conflict1 are ordered. Histories can be classified into two types: (a) single-version or mono-version and

(b) multi-version histories. A single-version history captures what “happens” in the execution of a single-version or multi-version database system and is a special case of a multi-version history since the scheduler or, more precisely, a version function f maps each read operation r j[x,v] on an object x to its most recent write operation wi[x,v] that precedes it in the history, i.e., wi[x,v] < r j[x,v] and if wk[x,v] also occurs in the history (i 6= k) then either wk[x,v] < wi[x,v] or r j[x,v] < wk[x,v].A multi-version history relaxes this restriction by allowing read requests to be mapped to appropriate, but not necessarily up-to-date versions of data. This flexibility gives the scheduler the potential to produce more correct histories (since read operations that arrive too late may not any more be rejected) which, in turn, may improve the degree of concurrency in the system. A prerequisite to exploit the performance benefits of multi-versioning is, of course, to maintain multiple versions of objects in the system. If multiple object versions are allowed to be kept, each write operation on an object x by transaction Ti produces a new version of it, which we denote by xi, where the version subscript represents the index of the transaction Ti that wrote the version. Thus, each write operation by transaction Ti in a multi-version history is always of the form wi[xi]. If the value v is written into xi by Ti, we use the notation wi[xi,v]. To indicate that transaction Ti has read a version of object x that was installed by transaction Tj, we denote this as ri[x j]. In case value v of object

1Two operations p[x,v] and q[x,v] are said to conflict if one of them is a write operation. 4.2. Preliminaries 67 version x j has been read by Ti, we represent this by ri[x j,v]. After these notational preliminaries, we are now in the position to give the formal definition of a multi-version history:

Definition 11 (Multi-Version History). A multi-version history MVH over a set of transactions

T = {T0,T1,...,Tn} consists of two parts — a partial order (∑T ,

Sn 1. ∑T = f ( i=0 ∑i) for some multi-version function f ;

2. for each Ti ∈ T and all operations p, q ∈ Ti, if p

3. if f (r j[x,v]) = r j[xi,v], then wi[xi,v]

4. if f (r j[x,v]) = r j[xi,v], i 6= j, and c j ∈ MVH, then ci

and a version order, , i.e., there is a total order on the committed object versions in MVH which may be different from the relative ordering of write or commit operations in MVH.



Point (1) says that the version function f maps all operations of all transactions in the history into appropriate multi-version operations and Condition (2) states that the mapping respects the individual transaction orderings. Statement (3) specifies that a transaction may not read an object version before it has been created. To ensure that Statement (3) always holds, we assume the existence of an initialization transaction T0 that creates a so-called zero version of each data object stored in the database. Point (4) finally states that a transaction may only commit, if all other transactions that created object versions it read are themselves committed.

For notational convenience, we place some additional constraints on the definition of a multi- version history given above, and in what follows, we require that the version order of an object x in a multi-version history MVH corresponds to the temporal order in which write operations of x occur in MVH, i.e., whenever write operation wi[xi,v] immediately precedes write operation w j[x j,v] in

MVH, then xi  x j. Additionally, we require that the version order in MVH cannot be different from the order of commit events in MVH, i.e., if the version order of object x is xi  x j, then ci

(i.e., xk  x j), then ci

Definition 12 (Transaction Projection). Let MVH denote a multi-version history, Ti a transaction occurring partially or completely in MVH and OP(Ti) its finite set of transaction operations. A transaction projection of a multi-version history MVH onto Ti, denoted P(MVH,Ti), is a subhis-

0 0 0 tory MVH containing transaction operations OP(MVH ) = OP(Ti), i.e., MVH includes only the operations issued by Ti. 

In certain cases the projection onto transactions that committed within a certain logical time interval is required in order to reason about scheduling correctness. This gives rise to the following definition:

Definition 13 (Interval Committed Projection). Let MVH denote a multi-version history, and let again p and q represent two distinct events in MVH, i.e., p, q ∈ ∑T , and p

It is important to note that both projections preserve the relative order of the original operations.

To validate the correctness of multi-version histories w.r.t. ILs defined in Section 4.3, we need to formalize possible direct and indirect data dependencies between transactions:

wr Definition 14 (Direct Write-Read Dependency). A direct write-read dependency (Tj δ Ti) be- tween transactions Ti and Tj exists if there is a write operation w j which precedes a read operation ri in MVH according to

wr denote write-read dependencies by δ or wr. 

ww Definition 15 (Direct Write-Write Dependency). A direct write-write dependency (Tj δ Ti) be- tween transactions Ti and Tj exists if there exists a write operation w j which precedes a write operation wi in MVH according to

rw Definition 16 (Direct Read-Write Dependency). A direct read-write dependency (Tj δ Ti) occurs between two transactions Ti and Tj if there is a read operation r j and a write operation wi in MVH in the order r j

If the type of dependency between two distinct transactions does not matter, we say that they are in an arbitrary dependency:

Definition 17 (Arbitrary Direct Dependency). Two transactions Ti and Tj are in an arbitrary direct dependency in MVH, if there exists a direct rw-, ww-, or wr-dependency between Ti and Tj. ? We denote arbitrary direct dependencies by δ or ?. 

Definition 18 (Arbitrary Indirect Dependency). Two transactions transactions Ti and Tj are in an ? ? ? ? arbitrary indirect dependency in MVH, if there exists a sequence hTj δ Tk1 δ Tk2 ... δ Tkn δ Tii for ∗ (n ≥ 1) in MVH. We denote arbitrary indirect dependencies by δ or ∗. 

We are now in the position to define the direct multi-version serialization graph of a multi- version history MVH, called MVSG(MVH), which in contrast to the Serialization Graph definition of [26] contains labeled edges indicating which dependencies occur between the transactions.

Definition 19 (Direct Multi-Version Serialization Graph). A direct multi-version serialization graph MVSG(MVH) is defined on a multi-version history MVH, with nodes representing transac- tions that successfully terminated, and each labeled edge from transaction Ti to Tj corresponds to a direct read-write, write-write, or write-read dependency, i.e., there is a rw, ww, or wr-dependency edge from transaction Ti to transaction Tj if and only if Tj directly rw, ww, or wr-depends on Ti. 70 Chapter 4. Processing Read-only Transactions Efficiently and Correctly



4.3 New Isolation Levels Suitable for Read-only Transactions

4.3.1 Why Serializability may be Insufficient

Serializability is the standard criterion for transaction processing in both stationary and mobile computing. Its importance and popularity is related to the fact that it prevents read-write transac- tions from violating database consistency by assuring that they always transform the database from one consistent state into another. With respect to read-only transactions, serializability as defined in [26] guarantees that all read-only transactions perceive the same serial order of read-write trans- actions. Additionally, serializability requires that read-only transactions serialize with each other.

However, the serializability criterion in itself is not sufficient for preventing read-only transactions from experiencing anomalies related to data currency as the following example shows:

Example 3.

MVH1 = b0 w0[x0,2:40 pm] b1 r1[z0,cloudy] w0[y0,2:50 pm] c0 w1[z1,blizzard] c1

b2 r2[z1,blizzard] r2[x0,2:40 pm] w2[x2,2:50 pm] c2 b3 r3[x0,2:40 pm] b4

r4[x2,2:50 pm] b5 r5[z1,blizzard] r5[y0,2:50 pm] r3[y0,2:50 pm] c3 w5[y5,3:00 pm]

c5 r4[y5,3:00 pm] c4 [x0  x2,y0  y5,z0  z1]

History MVH1 might be produced by a flight scheduling system supporting multiple object ver- sions, which is the rule rather than an exception in mobile distributed database systems. In MVH1, transaction T0 is a blind write transaction that sets the take-off times of flights x and y, respectively, and T1 is an event-driven transaction initiated automatically by the airport weather station to indi- cate an imminent weather change. Due to the inclement weather forecast, the Air Traffic Control

Center instantly delays both scheduled flights by 10 minutes through transactions T2 and T5. At the same time, two co-located members of the ground staff equipped with PDAs query the airport

flight scheduling system in response to passengers’ requests to check the actual take-off times of

flights x and y (T3 and T4). While one of the employees (who invokes transaction T3) may locate the required data in his local cache, the other (who invokes transaction T4) may have to connect to 4.3. New Isolation Levels Suitable for Read-only Transactions 71 the central database in order to satisfy his data requirements. As a consequence, both persons read from different database snapshots without serializability guarantees being violated, which can be easily verified by sketching the multi-version serialization graph (MVSG) of MVH1. 

T1 wr wr,ww T0 T2 wr,ww rw wr wr

T3 wr T4

wr T5

Figure 4.1: Multi-version serialization graph of MVH1.

As the above example clearly illustrates, serializability by itself may not be a sufficient require- ment for avoiding phenomena related to reading from old database snapshots. This shortage is eliminated in the following subsections.

4.3.2 BOT Serializability

Influenced by the findings of the previous example, we now define two new ILs that combine the strictness of serializability with firm data currency guarantees (see below for the definition of this notion). Unlike the ANSI definition of serializability, our definition ensures well-defined data currency to read-only transactions. The existing ANSI specification of serializability and its redefi- nition by [10] do not contain any data currency guarantees for read-only transactions. Under those levels, read-only transactions are allowed to be executed without any restrictions w.r.t. the currency of the observed data. We will define our ILs in terms of histories. We associate a directed graph with each newly defined isolation level ILi. A multi-version history MVH provides ILi guarantees, if the corresponding graph is acyclic.

In what follows, we define only such ILs that are especially attractive for the mobile broadcast- ing environment where clients run data-dissemination applications forced to read (nearly) up-to- date database objects and are expected to be rarely disconnected from the server. Based on some 72 Chapter 4. Processing Read-only Transactions Efficiently and Correctly research done on real-time transactions [1, 63], we divide data currency requirements into three categories: transactions with (a) strong, (b) firm, and (c) weak requirements.

Definition 20 (Strong Data Currency Requirements). We say that a read-only transaction Ti has strong data currency requirements, if it needs to read committed data that is (still) up-to-date by Ti’s commit time. Since all read operations of Ti must be valid at the end of the transaction’s execution, we also say that Ti runs with End of Transaction (EOT) data currency guarantees. 

Note that the EOT data currency property requires only that writes of committed read-write trans- actions must not interfere with operations of read-only transactions, i.e., object updates of uncom- mitted transactions are not considered by that property.

The firm currency requirement, in turn, provides slightly weaker currency guarantees.

Definition 21 (Firm Data Currency Requirements). We say that a read-only transaction Ti has firm data currency requirements, if it needs to observe committed data that is at least as recent as of the time Ti started its execution. 

Firm currency requirements are attractive for the processing of read-only transactions in mobile broadcasting environments for mainly two reasons: (a) First, and most importantly from the data currency perspective, they guarantee that read-only transactions observe up-to-date or nearly up-to- date data objects, which is an important criterion for data-dissemination applications such as news and sports tickers, stock market monitors, traffic and parking information systems, etc. (b) Second, and contrary to the strong currency requirements, they can easily and instantaneously be validated at the client site without any communication with the server.

For some mobile database applications, however, weak data currency requirements may suffice.

Definition 22 (Weak Data Currency Requirements). We say that a read-only transaction Ti has weak data currency requirements, if it sees a database state the way it existed at some point in time before its actual starting point. 

Despite the unquestionable attractiveness of weaker currency requirements, especially for ap- 4.3. New Isolation Levels Suitable for Read-only Transactions 73 plications running on clients with frequent disconnections, we believe that the majority of data- dissemination applications require firm currency guarantees which is supported by the litera- ture [145,163]. Thus, in this thesis we focus on firm currency guarantees and leave out the extension of known ILs by strong and weak data currency requirements as future work. Prior to specifying a new IL that provides serializability along with firm data currency guarantees, some additional concepts are to be introduced.

As defined so far, a multi-version history MVH consists of two components: (a) a partial order of database events (∑T ) and (b) a total order of object versions (). Now, we extend the defini- tion of a multi-version history by specifying for each read-only transaction a start time order that relates its starting point to the commit time of previously terminated read-write transactions. The association of a start time order with a multi-version history was first introduced in the context of the Snapshot Isolation (SI) level [23] to provide more flexibility for implementations. According to the SI concept, the database system is free to choose a starting point for a transaction as long as the selected starting point is some (logical) time before its first read operation. Allowing the system to choose a transaction’s starting point without any restrictions is inappropriate in situations where the user expects to read from a database state that existed at some time close to the transaction’s actual starting point. Thus, for applications/transactions to work correctly, the database system needs to select a transaction’s starting point in accordance with the order of events in MVH. We now formally define the concept of start time order.

Definition 23 (Start Time Order). A start time order of a multi-version history MVH over a set of committed transactions T = {T0,T1,...,Tn} is a partial order (ST ,

Sn 1. ST = i=0 {ci,bi};

2. ∀Ti ∈ T, bi

3. If Ti, Tj ∈ T, then either c j

4. If wi, w j ∈ MVH, wi  w j, and c j



According to Statement 1 the start time order relates begin and commit operations of committed transactions in MVH. Point 2 states that a transaction’s starting point always precedes its commit point. Condition 3 states that a scheduler S has three possibilities in ordering the start and commit points of any committed transactions Ti and Tj in MVH. A scheduler S may choose Ti’s (Tj’s) starting point after Tj’s (Ti’s) commit point or, if both transactions are concurrent, neither starts its execution after the other transaction has committed. Condition 4 finally specifies that if S chooses

Tk’s starting point after Tj’s commit point and Tj overwrites the object installed by Ti then Ti’s commit point must precede Tk’s starting point in any start time order. For notational convenience, in what follows, we do not specify a start time order for all commit- ted transactions in MVH. Instead, we only associate with each MVH the start time order between read-only and read-write transactions. After laying these foundations, we are ready to define the begin-of-transaction (BOT) data currency property required for the definition of the BOT Serializ- ability IL.

Definition 24 (BOT Data Currency). A read-only transaction Ti possesses BOT data currency guarantees if for all read operations invoked by Ti the following invariants hold:

1. If the pair w j[x j] and ri[x j] is in MVH, then c j

2. If there is another write operation wk[xk] of a committed transaction Tk in MVH, then either

(a) ck

(b) bi

Condition 1 and Condition 2 ensure that all read operations performed by a read-only transac- tion Ti are from a snapshot of committed data values valid as of Ti’s starting point. Note that we ignore transaction aborts in our definition of BOT data currency since subsequent definitions that incorporate this criterion only consider MVHs of committed transactions. On the basis of the BOT data currency property, the serializability IL can be extended as follows. 4.3. New Isolation Levels Suitable for Read-only Transactions 75

Definition 25 (BOT Serializability). A multi-version history MVH over a set of read-only and read-write transactions is BOT serializable, if MVH is serializable in the sense that the projection of MVH onto all committed transactions in MVH is equivalent to some serial history and the BOT data currency property holds for all read-only transactions in MVH. 

Note that we do not explicitly define data currency guarantees for read operations of read-write transactions at this stage, but we make up for it in Chapter 6. To determine if a given multi- version history MVH satisfies the requirements of the BOT Serializability IL, we use a variation of the MVSG, called start time multi-version serialization graph (ST-MVSG), which is defined as follows:

Definition 26 (Start Time Multi-Version Serialization Graph). Let MVH denote a history over a set of read-only and read-write transactions T = {T0,T1,...,Tn} and commit(MVH) rep- resent a function that returns the committed transactions of MVH. A start time multi-version se- rialization graph for history MVH, denoted ST-MVSG(MVH), is a directed graph with nodes

N = commit(MVH) and labeled edges E such that:

1. There is an edge Ti → Tj (Ti 6= Tj) if Tj ?-depends on Ti;

2. There is an edge Ti → Tj (Ti 6= Tj) whenever there exists a set of operations

{ri[x j],w j[x j],wk[xk]} such that either w j  wk and ck

Theorem 1. Let MVH be a multi-version history over a set of committed transactions

T = {T0,T1,...,Tn}. Then, MVH is executed under BOT Serializability, if ST-MVSG(MVH) is acyclic.

Proof. We prove Theorem 1 by contraposition. We try to show that, if any of the conditions of

Definition 26 are false, then ST-MVSG(MVH) contains a cycle. Suppose that MVH is a non- serializable multi-version history, i.e., there exists a serialization order hT0,T1,...,Tk,T0i such that

T1 ?-depends on T0, ... ?-depends on T1 , and T0 ?-depends on Tk. By Definition 26, there is an edge Ti → Tj in ST-MVSG(MVH) whenever Tj ?-depends on Ti. Thus, ST-MVSG(MVH) contains a cycle whenever MVH is non-serializable. Next, suppose that the BOT data currency property is 76 Chapter 4. Processing Read-only Transactions Efficiently and Correctly violated. That is, there exists a pair of operations ri[x j] and w j[x j] such that either w j  wk and ck

4.3.3 Strict Forward BOT Serializability

The currency requirements of the BOT Serializability IL may not be ideally suited for process- ing read-only transactions in mobile broadcasting environments for at least two reasons: (a) First, mobile read-only transactions are mostly long-running in nature due to such factors as interactive data usage, intentional or accidental disconnections, and/or high communication delays. Therefore, disallowing a long-lived read-only transaction to see object versions that were created by commit- ted read-write transactions after its starting point might be too restrictive. (b) Another reason for allowing read-only transactions to read “forward” beyond their starting points is related to version management costs. Reading from a snapshot of the database that existed at the time when a read- only transaction started its execution can be expensive in terms of storage costs. If database objects are frequently updated, which is a reasonable assumption for data-dissemination environments [41], multiple previous object versions have to be retained in various parts of the client-server system ar- chitecture. Allowing read-only transactions to view more recent data than permitted by the BOT data currency property is efficient, since it enables purging out-of-date objects sooner, thus allowing to keep more recent objects in the database system. An IL that provides such currency guarantees while still enforcing degree 3 data consistency is called Strict Forward BOT Serializability. Prior to defining this IL, we formulate a rule that is sufficient and practicable for determining whether a read-only transaction Ti may be allowed to see the (complete) effects of an update transaction that committed after Ti’s starting point without violating serializability requirements.

Read Rule 1 (Serializable Forward Reads). Let Ti denote a read-only transaction that needs to observe the complete effects of an update transaction Tj that committed after

i Ti’s starting point as long as the serializability requirement holds. Further, let Tbe f ore 4.3. New Isolation Levels Suitable for Read-only Transactions 77 denote a set of read-write transactions that committed before Ti’s starting point and let

i Ta fter represent a set of read-write transactions that committed after Ti’s starting point, but before the commit point of Tj and whose effects have not been seen by Ti, i.e.,

i ∀k(Tk ∈ Ta fter) : (bi

i 1. If the invariant ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tj) = 0/ is true, i.e., the intersection of the

actual read set of Ti and the write set of all read-write transactions that committed between

Ti’s starting point and Tj’s commit point (including Tj itself) must be an empty set.

Otherwise, Ti is forced to observe the database state that was valid as of the time Ti started, i.e.,

Ti is obliged to read the most recent version of an object produced by a read-write transaction in

i Tbe f ore. In what follows, we denote the fact that Ti is permitted to read forward on the object versions produced by Tj, by Ti →s f r Tj. 

Note that in Read Rule 1 the read set and the write set refer to data objects and not to their dedicated versions. This will be the case throughout the chapter if not otherwise specified.

An example illustrating how the invariants of Read Rule 1 are applied to decide whether a read- only transaction Ti can safely observe the effects of update transactions that committed after its starting point is as follows:

Example 4.

MVH2 = b0 w0[x0] w0[y0] w0[z0] c0 b1 r1[y0] b2 r2[x0] w2[z2] b3 c2 b4 r4[x0] r3[z2] w3[y3] c3

w4[x4] c4 [x0  x4,y0  y3,z0  z2,c0

In MVH2, T0 blindly writes the objects x, y, and z. After T0’s commit point, the read-only transaction

T1 starts running and reads the previously initialized value of y. T2 subsequently observes the value of x and produces a new version of z, which is, in turn, read by T3. In the meantime, T4 is started and accesses object x. Thereafter, T3 creates a new version of object y. Finally, T4 updates the initialized value of x and commits. Now suppose transaction T1 wants to read object z and thereafter object x.

If we assume that versions {x0, x4} and {z0, z2} are maintained in the database by the time when 78 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

T1’s read requests arrive, the scheduler has to decide which version of the objects z and x T1 can safely observe. If T1 runs at the BOT Serializability IL, the scheduler’s decision is straightforward since T1 needs to access the most recent object versions that existed by its starting point. In this case, T1 would have to read the versions created by T0. However, if the underlying IL requires that T1 should see the updates of transactions that committed after its BOT point as long as the serializability criterion is not violated, the scheduler has to check for every object T1 intends to read whether there exists any committed object version that was installed after T1’s starting point and, if so, whether Read Rule 1 is satisfied. With regard to objects x and z, the reader can easily see from MVH2 that both objects were updated after T1’s BOT point. Hence, the scheduler has to verify

1 for both recently created object versions whether invariant ReadSet(T1)∩WriteSet(Ta fter ∪T2) = 0/ 1 and/or ReadSet(T1) ∩ WriteSet(Ta fter ∪ T4) = 0/ holds. Object z is requested first and therefore the scheduler intersects the current read set of T1 (ReadSet(T1) = {y}) with the write set of all transactions that committed after T1’s BOT point and the commit point of T2 that installed the latest

1 version of z (WriteSet(Ta fter ∪ T2) = {z}). Since the result of the intersection is an empty set, the scheduler allows T1 to read the most up-to-date version of z. Now we repeat the same procedure for object x. This time, however, the read set of T1 consists of two objects (ReadSet(T1) = {y, z}) and the write set of the transactions that committed between T4’s commit point and T1’s start point

1 comprises also two objects (WriteSet(Ta fter ∪ T4) = {x, y}). Since the intersection of the read and write sets is non-empty, T1 is not allowed to read “forward” on object x as Read Rule 1 would otherwise be violated. Therefore, T1 is forced to observe the object version of x that existed by its

BOT point, namely x0. 

It can be shown that Read Rule 1 produces only correct read-only transactions in the sense that they are serializable w.r.t. all committed update transactions and all other committed read-only transactions in a multi-version history MVH:

Theorem 2. In a multi-version history MVH that contains a set of read-write transactions Tupdate such that all transactions in Tupdate are serializable, each read-only transaction Ti satisfying Read

Rule 1 is serializable w.r.t. Tupdate as well. 4.3. New Isolation Levels Suitable for Read-only Transactions 79

i Proof. Let Tbe f ore denote a set of read-write transactions that committed before Ti’s starting point i and let Ta fter represent a set of read-write transactions that committed after Ti’s starting point and i whose effects have not been seen by Ti. Additionally, let Tf orward denote the set of read-write transactions that committed after Ti’s starting point, but whose effects have been observed by Ti.

Suppose, by the way of contradiction, that MVH contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti is a read-only transaction and Tjn with n ≥ 0 is either a read-only or read-write transaction. i i i By assumption, the set of read-write transactions Tupdate = Tbe f ore ∪ Ta fter ∪ Tf orward in MVH is serializable, thus the cycle can only occur if at least one read-only transaction is part of it.

For now suppose the existence of a so-called single-query cycle, i.e., Ti is the only read-only transaction involved in it. Then, in order for the cycle to be formed, Ti requires both an incoming and outgoing edge. Since Ti is a read-only transaction which, by definition, performs read operations only, the outgoing edge Ti → Tj0 is to a read-write transaction Tj0 that wrote the successor version of an object read by Ti and the incoming edge Tjn → Ti is to a read-write transaction Tjn that installed an object version read by Ti. By Read Rule 1, Ti is guaranteed to see either all or none of the effects of any read-write transaction and thus, the outgoing edge Ti → Tj0 and the incoming edge

Tjn → Ti must involve two distinct read-write transactions, i.e., Tj0 6= Tjn . A further prerequisite for a cycle to occur is that Tjn ?-depends or *-depends on Tj0 . According to Definition 11 and the specified version and commit order constraints, such a dependency implies that Tj0 committed before Tjn in MVH, i.e., c j0

Read Rule 1 would be violated since it enforces Ti to see the complete effects of Tj0 whenever the i condition ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tj0 ) = 0/ holds. Since Tjn ?-depends or *-depends on Tj0 i and Tj0 committed before Tjn in MVH, the condition ReadSet(Ti)∩WriteSet(Ta fter ∪Tj0 ) = 0/ holds i whenever ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tjn ) = 0/ is true. Consequently, if Ti is allowed to see the effects of Tjn , it is also allowed to observe the effects of Tj0 and therefore, Tj0 cannot rw-depend on

Ti which is a prerequisite for the cycle to occur.

Now let us assume that a multi-query cycle exists. Again, for such a cycle to be produced, Ti 80 Chapter 4. Processing Read-only Transactions Efficiently and Correctly needs to have at least one incoming and one outgoing edge to two distinct read-write transactions

Tj0 and Tjn . However, and in contrast to the previous case, Tjn does not need to ?-depend or *-depend on Tj0 any more. Now, and in order for a cycle to occur, it suffices that another read-only transaction

Tjm with 0 < m < n (directly or indirectly) wr-depends on Tj0 and Tjn (directly or indirectly) rw- depends on Tjm , i.e., the cycle may have the form hTi → Tj0 → ... → Tjm → ... → Tjn → Tii, where i m > 0 and n > m. In order for Tj0 to rw-depend on Ti, Tj0 needs to be a member of Ta fter and the i condition ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tj0 ) 6= 0/ needs to be true. Similar, for Tjn to rw-depend

jm jm on Tjm , Tjn needs to be part of Ta fter and the condition ReadSet(Tjm ) ∩WriteSet(Ta fter ∪ Tjn ) 6= 0/ needs to hold. Now suppose without loss of generality that Tjn committed before Tj0 in MVH, i.e., c jn

jm clearly contradicts the condition ReadSet(Tjm ) ∩WriteSet(Ta fter ∪ Tj0 ) = 0/ of Read Rule 1 since it

jm can only be fulfilled if there exists no other read-write transaction Tjk ∈ Ta fter that committed before

Tj0 in MVH and that does not allow Ti to read “forward” on its effects. Therefore, if a scheduler operates according to Read Rule 1, single- and multi-query cycles cannot occur in MVH. 

The following new IL incorporates the Serializable Forward Reads property, and is defined as follows:

Definition 27 (Strict Forward BOT Serializability). A multi-version history MVH over a set of read-only and read-write transactions is a strict forward BOT serializable history, if all of the following conditions hold:

1. MVH is serializable, and

2. if the pair ri[x j] and w j[x j] of a read-only transaction Ti and a read-write transaction Tj is in MVH, then either:

(a) c j

such that ck

(b) bi

committed transaction Tk in MVH such that x j  xk, ck

To check whether a given history MVH is strict forward BOT serializable, we again use a variant of the MVSG, called strict forward read multi-version serialization graph, which is defined as follows:

Definition 28 (Strict Forward Read Multi-Version Serialization Graph). A strict forward read multi-version serialization graph for a multi-version history MVH, denoted SFR-MVSG(MVH), is a directed graph with nodes N = P(MVH), where P(MVH) denotes the committed projection onto

MVH and labeled edges E such that:

1. There is an edge Ti → Tj (Ti 6= Tj), if Tj ?-depends on Ti.

2. There is an edge Ti → Tj (Ti 6= Tj), whenever there exists a pair of operations ri[x j] and w j[x j]

of a read-only transaction Ti and a read-write transaction Tj such that w j  wk and ck

3. There is an edge Ti → Tj (Ti 6= Tj), whenever there exists a pair of operations ri[x j] and

w j[x j] of a read-only transaction Ti and a read-write transaction Tj such that bi

c j

in MVH such that x j  xk, ck

Theorem 3. A history MVH consisting of committed read-only and read-write transactions executes under Strict Forward BOT Serializability, if SFR-MVSG(MVH) is acyclic.

Proof. We show that the contrapositive of Theorem 3 holds. If Property 1 of Definition 27 is violated, then MVH is a non-serializable history and SFR-MVSG(MVH) contains a cycle according to Point 1 of Definition 28. Now suppose that Requirement 2 of Definition 27 does not hold. That is, there exists a pair of operations ri[x j] and w j[x j] such that either:

1. c j

2. bi

3. bi →MVH c j, c j[x j]

x j  xk, Ti →s f r Tk. 82 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

In the first case, there is an edge Ti → Tj according to Point 2 of Definition 28 and an edge

Tj → Ti since Ti write-read depends on Tj. Thus, SFR-MVSG(MVH) contains a cycle. In the second case, there is again an edge Tj → Ti since Ti write-read depends on Tj. By assumption, the property ¬Ti →s f r Tj holds, which, in turn, requires that there is another read-write transaction

Tk in MVH that produced some object version xk that has been seen by Ti and Tj installed xk’s

(direct or indirect) successor object version or, alternatively, Tj ?-depends or *-depends on some read-write transaction Tl (k 6= l), whose effects Ti is also forbidden to see, i.e., ¬Ti →s f r Tl. This implies that Tj (directly or indirectly) rw-depends on Ti or Tl (directly or indirectly) rw-depends on

Ti. Thus, there is either an edge Ti → Tj or a chain of edges from Ti to Tj through Tl according to Point 1 of Definition 28. Thus, SFR-MVSG(MVH) is cyclic. In the last case, there is an edge

Ti → Tj from Point 3 of Definition 28 and an edge Tj → Ti since Ti directly reads from Tj. Again,

SFR-MVSG(MVH) contains a cycle. 

4.3.4 Update Serializability

While the strictness of serializability may be necessary for some read-only transactions, the use of such strong criteria often is overly restrictive and may negatively affect the overall system perfor- mance. Even worse, serializability does not only trade consistency for performance, but it also has an impact on data currency. Such drawbacks can be eliminated or at least diminished by allowing read-only transactions to be executed at weaker ILs. Various correctness criteria have been pro- posed in the literature to achieve performance benefits by allowing non-serializable execution of read-only transactions. While some forms of consistency such as Update Serializability/Weak Con- sistency [29, 61, 162] or External Consistency/Update Consistency [29, 159] require that read-only transactions observe a consistent database state, others such as Epsilon Serializability [88] allow them to view transaction-inconsistent data. We believe that the majority of read-only transactions need to see a transaction-consistent database state and therefore we focus solely on ILs that provide such guarantees. An IL that is strictly weaker than serializability and allows read-only transactions to see a transaction-consistent states is the Update Serializability (US) level which can be formally defined as follows: 4.3. New Isolation Levels Suitable for Read-only Transactions 83

Definition 29 (Update Serializability). Let us denote the set of committed read-write transactions by Tupdate = {T0,T1,...,Tn} and the projection of MVH onto Tupdate by P(MVH, Tupdate). A multi- version history MVH over a set of read-only and read-write transactions is an update serializable history, if for each read-only transaction Ti in MVH the subhistory P(MVH,Tupdate) ∪ P(MVH,Ti) is serializable. If there are no read-only transactions in MVH, then only the subhistory

P(MVH,Tupdate) needs to be serializable. 

Update Serializability differs from the Serializability IL by allowing read-only transactions to serialize individually with the set of committed read-write transactions in a multi-version history

MVH, i.e., it relaxes the strictness of the serializability criterion by requiring that read-only trans- actions are serializable w.r.t. committed read-write transactions, but not w.r.t. other committed read-only transactions.

4.3.5 Strict Forward BOT Update Serializability

Update Serializability as defined above allows different read-only transactions to view different transaction-consistent database states that result from different serialization orders of read-write transactions. By not requiring that all read-only transactions have to see the same consistent state, more concurrency between read-only and read-write transactions is possible. However, higher transaction throughput by relaxing the consistency requirement may not be achieved at the cost of providing no or unacceptable data currency guarantees to users. It is obvious that Update Seri- alizability lacks any currency requirements, thus we need to extent the Update Serializability IL by incorporating such guarantees.

To illustrate this requirement, we propose the following example history that has been produced by our flight scheduling system:

Example 5.

MVH3 = b0 w0[x0,2:40 pm] b1 r1[z0,cloudy] w0[y0,2:50 pm] c0 w1[z1,blizzard] c1 b2

r2[z1,blizzard] b3 r3[z1,blizzard] r2[x0,2:40 pm] w2[x2,2:50pm] c2 r3[y0,2:50 pm]

b4 r4[x0,2:50 pm] w3[y3,3:00 pm] c3 b5 r5[y0,2:50 pm] b6 r6[z1,blizzard] 84 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

r6[x2,2:50 pm] w6[x6,3:00 pm] c6 b7 r7[z1,blizzard] r7[y3,3:00 pm] w7[y7,3:10 pm]

c7 r5[x6,3:00 pm] c5 r4[y7,3:10 pm] c4

[x0  x2,y0  y3  y7,z0  z1,c2

History MVH3 represents an extension of MVH1 since it contains two additional updates of the departure time of the flights x and y. As in Example 3 take-off times of fights x and y need to be delayed due to an imminent change in local weather conditions. The first amendment of the

flight’s schedule is performed by transactions T2 and T3. Since weather conditions are not going to improve in the foreseeable future both flights need to be rescheduled repeatedly which is carried out by transactions T6 and T7. Between both modifications of the departure times, two employees of the airport personnel are asked by passengers to query the flight schedule system to get the latest data on the status of flights x and y. To improve application response time, this time both co-located employees initiate their read-only transactions T4 and T5 with Update Serializability guarantees. At the transactions’ start time, both employees are disconnected from the central flight scheduling system due to their unfavorable position with regard to the access points of the wireless

LAN at the airport. However, despite being disconnected, both read-only transactions start their operations since their first requested objects (x and y, respectively) are located in the memory of their PDAs. Since the other requested object is not cache-resident, both clients need to wait to be reconnected to proceed with transaction processing. By the time both clients get reconnected, both

flights have been delayed repeatedly and transactions T4 and T5 read the latest data on the flights y and x, respectively. 

To illustrate that the scheduler has produced a correct schedule, Figure 4.2 shows the MVSG of history MVH3. It is easy to see that MVH3 is an update serializable history since the graph’s cycle hT5 → T3 → T7 → T4 → T2 → T6 → T5i can be eliminated by removing either T4 or T5 from

MVSG(MVH3). Although both read-only transactions are processed in compliance with the Update Serializability requirements, it is easy to imagine that the produced query results are undesirable since they may be confusing to the database users especially if they communicate with each other in order to share the information obtained. Again, this example provides evidence that conventional isolation levels need to be redefined or extended in order to be appropriate for read-only transactions 4.3. New Isolation Levels Suitable for Read-only Transactions 85 with data currency constraints.

T wr,ww wr,ww 0 wr

T3 wr wr T2 wr ww, rw ww, wr rw wr T wr wr 1 T7 T6 wr wr

T5 T4

Figure 4.2: Multi-version serialization graph of MVH3.

Implications on transaction correctness due to the reads from out-of-date objects as shown in

Example 5 can be diminished by adding data currency guarantees to the definition of Update Se- rializability. As data currency and consistency are orthogonal concepts, it is possible to combine

Update Serializability with various types of currency. As before, we concentrate on the BOT data currency type, since applications frequently require the values of the disseminated objects to be up-to-date or at least “close” to the most current values [41, 145, 163]. Actually there is no need to define a new IL that provides BOT data currency guarantees in combination with Update Serializ- ability correctness since such a level would produce scheduling results that are consistent with the results produced by the already defined BOT Serializability degree. Nevertheless, extending Up- date Serializability by the requirement that a read-only transaction Ti must perceive the most recent version of committed objects that existed by Ti’s starting point or thereafter seems to be a valuable property in terms of currency and performance. However, forward reads beyond Ti’s start point should only be allowed, if the Update Serializability criterion is not violated. In order to determine whether a read-only transaction Ti can safely read “forward” on some object version x that has been created by a committed read-write transaction Tj after its starting point, the following property can be used:

Read Rule 2 (Update Serializable Forward Reads). Let Ti denote a read-only transaction in a multi-version history MVH that requires to observe the (complete) effects of a read-write transac- tion Tj that committed after Ti’s starting point as long as the Update Serializability requirements

i are not violated. Again, let Tbe f ore denote a set of read-write transactions that committed before i Ti’s starting point and let Ta fter represent a set of read-write transactions that committed after Ti’s 86 Chapter 4. Processing Read-only Transactions Efficiently and Correctly starting point, but before the commit point of Tj and whose effects have not been seen by Ti, i.e.,

i ∀k(Tk ∈ Ta fter) : (bi

i 1. If the invariant ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tj) = 0/ holds or

2. If the condition ReadSet(Ti) ∩WriteSet(Tj) = 0/ holds and there is no read-write transaction

Tk in MVH ( j 6= k, i 6= k) such that bi

*-depends on Tk.

Otherwise, Ti is forced to observe the database state that was valid at its start time, i.e., Ti is obliged

i to read the most up-to-date version of an object produced by a read-write transaction in Tbe f ore. In what follows, we represent the fact that Ti is allowed to read “forward” to observe the effects of Tj by Ti →us f r Tj. 

As before, it can be shown that Read Rule 2 produces only correct histories in the sense that each read-only transaction sees a serial order of all committed read-write transactions in a multi-version history MVH.

Theorem 4. In a multi-version history MVH that contains a set of read-write transactions Tupdate such that all read-write transactions in Tupdate are serializable, each read-only transaction Ti satis- fying Read Rule 2 is update serializable w.r.t. Tupdate as well.

i Proof. Again, let Ta fter represent a set of read-write transactions that committed after Ti’s starting i point and whose effects have not been seen by Ti and let Tf orward denote the set of read-write transactions that committed after Ti’s starting point, but whose effects have been observed by Ti. By

Definition 29, a read-only transaction Ti is said to be update serializable if it observes a serializable database state but unlike serializability, the serial ordering observed by Ti could be different from that observed by other read-only transactions. As a result, Update Serializability permits multi- query cycles involving multiple read-only transactions and one or more read-write transactions, but prohibits single-query cycles involving a single read-only transaction and one or more read-write transactions. Now suppose, by the way of contradiction, that MVH contains a single-query cycle 4.3. New Isolation Levels Suitable for Read-only Transactions 87

hTi → Tj0 → ... → Tjn → Tii, where Ti is a read-only transaction and Tjn with n ≥ 0 is a read- write transaction of Tupdate. By assumption, the set of read-write transactions Tupdate in MVH is serializable, thus the cycle can only occur between a single read-only transaction and the read-write transactions in Tupdate. In order for the above described cycle to be produced, Ti must have both an incoming and outgoing edge. Thereby, the outgoing edge Ti → Tj0 is to point to a read-write transaction Tj0 that wrote the successor version of an object read by Ti and the incoming edge

Tjn → Ti is to originate from a read-write transaction Tjn that installed an object version read by Ti.

By Read Rule 2, Ti is guaranteed to either miss the effects or observe the complete effects of any read-write transaction and thus, the outgoing edge Ti → Tj0 and the incoming edge Tjn → Ti must involve two distinct read-write transactions, i.e., Tj0 6= Tjn . A further prerequisite for the cycle to occur is that Tjn ?-depends or *-depends on Tj0 . Now suppose (in addition to that Tj0 rw-depends on

Ti and that Ti wr-depends on Tjn ) that Tjn ?-depends or *-depends on Tj0 . This implies according to

Definition 11 and the additionally specified version and commit order constraints that Tj0 committed before Tjn , i.e., c j0

We can now define a new IL that ensures Update Serializability correctness along with strict forward BOT data currency guarantees:

Definition 30 (Strict Forward BOT Update Serializability). A multi-version history MVH over a set of read-only and read-write transactions is strict forward BOT update serializable, if the fol- lowing conditions hold:

1. MVH is update serializable, and 88 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

2. if the pair ri[x j] and w j[x j] of a read-only transaction Ti and a read-write transaction Tj are in MVH, then either

(a) Requirement 2a of Definition 27 is true or

(b) bi

committed transaction Tk in MVH such that x j  xk, ck

Again, we determine whether a given history MVH is strict forward BOT update serializable by using a directed MVSG:

Definition 31 (Strict Forward Read Single Query Multi-Version Serialization Graph). A strict forward read single query multi-version serialization graph for MVH w.r.t. a read-only transaction

Ti, denoted SFR-SQ-MVSG(MVH, Ti), is a directed graph with nodes N = Tupdate ∪ Ti and labeled edges E such that:

1. There is an edge Ti → Tj (Ti 6= Tj), if Tj ?-depends on Ti.

2. There is an edge Ti → Tj (Ti 6= Tj), whenever there exists a pair of operations ri[x j] and w j[x j]

of a read-only transaction Ti and a read-write transaction Tj such that w j  wk and ck

3. There is an edge Ti → Tj (Ti 6= Tj), whenever there exists a pair of operations ri[x j] and

w j[x j] of a read-only transaction Ti and a read-write transaction Tj such that bi

c j

tion Tk in MVH such that x j  xk, ck

Theorem 5. A history MVH consisting of committed read-only and read-write transactions ex- ecutes under Strict Forward BOT Update Serializability, if for each read-only transaction Ti the corresponding SFR-SQ-MVSG (MVH, Ti) is acyclic.

Proof. We again show that the contrapositive of Theorem 5 holds. If Requirement 1 of Defini- tion 30 is violated, then MVH is not consistent with the Update Serializability criterion and SFR-

SQ-MVSG(MVH, Ti) contains a cycle according to Point 1 of Definition 31. Now suppose that 4.3. New Isolation Levels Suitable for Read-only Transactions 89

Requirement 2 of Definition 30 does not hold. That is, there exists a pair of operations ri[x j] and w j[x j] such that either:

1. c j →MVH bi and there exists a write operation wk[xk] such that ck →MVH bi and x j  xk or

2. bi

3. bi

x j  xk and Ti →us f r Tj.

In the first case, there is an edge Ti → Tj according to Point 2 of Definition 31 and an edge

Tj → Ti since Ti write-read depends on Tj. Thus, SFR-SQ-MVSG(MVH) contains a cycle. In the second case, there is an edge Tj → Ti since Ti write-read depends on Tj. Further, since the property

i ¬Ti →us f r Tj holds, there is either an edge Ti → Tj because ReadSet(Ti)∩WriteSet(Ta fter ∪Tj) 6= 0/ or an edge from Ti to Tj via some read-write transaction Tk that committed after Ti’s starting point, but before Tj’s commit point and ¬Ti →us f r Tk. If the latter holds, then their exists a se- quence hTk → Tl0 → Tl1 → ... → Tln → Tji (n ≥ 0) of edges in SFR-SQ-MVSG(MVH, Ti) because

Tj ?-depends or *-depends on Tk according to Requirement 2 of Read Rule 2. Thus, SFR-SQ-

MVSG(MVH, Ti) is cyclic. In the final case, there is an edge Ti → Tj from Point 3 of Definition 31 and an edge Tj → Ti since Ti directly reads from Tj. Again, SFR-SQ-MVSG(MVH) contains a cycle. 

4.3.6 View Consistency

View Consistency (VC) is the weakest IL that ensures transaction consistency to read-only transac- tions provided that all read-write transactions modifying the database state are serializable. It was

first informally defined in the literature by [159] under the name External Consistency. Due to its valuable guarantees provided to read-only transactions, it appears to be an ideal candidate for use in all forms of environments including broadcasting systems. However, as noticed for the Full Se- rializability and Update Serializability degree, the definition of View Consistency lacks the notion of data currency. We formally define the View Consistency level as follows: 90 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

i Definition 32 (View Consistency). Let Tdepend denote a set of committed read-write transactions in MVH that the read-only transaction Ti (directly or indirectly) wr-depends on. A multi-version history MVH over a set of read-only and read-write transactions is view consistent, if all read- write transactions are serializable and for each read-only transaction Ti in MVH, the subhistory

i P(MVH,Tdepend) ∪ P(MVH,Ti) is serializable. 

The IL’s attractiveness relates to the fact that all read-write transactions produce a consistent database state and read-only transactions observe a transaction-consistent database state. However, as with Update Serializability, there might be a concern that two read-only transactions executed at the same or different clients may see different serial orders of read-write transactions. A further undesirable property of View Consistency might be that it allows read-only transactions to see a database state that would have existed if one or more read-write transactions had never been exe- cuted or if they had been aborted. Additionally, it allows read-only transactions to see a state that might not be consistent with the current state of the database. While the first and second potential problems can be resolved by running read-only transactions with Full Serializability and Update

Serializability IL guarantees, respectively, the latter issue can be compensated by extending the

View Consistency IL with appropriate currency guarantees. As for the Update Serializability IL, there is no need to define a new IL that ensures View Consistency correctness in combination with

BOT data currency since such an IL would produce scheduling histories consistent with the previ- ously defined BOT Serializability IL. However, extending the definition of the View Consistency

IL with a read “forward” obligation that requires a read-only transaction Ti to see the effects of read-write transactions that committed after its starting point as long as the requirements of View

Consistency correctness are satisfied, appears to be a worthwhile approach. However, before we formally define this new IL, we need to formalize a condition that allows us to determine whether a read-only transaction Ti can observe the effects of a read-write transaction Tj that committed its execution after Ti’s starting time.

Read Rule 3 (View Consistent Forward Reads). Let Ti denote a read-only transaction in a multi- version history MVH that requires to observe the (complete) effects of a read-write transaction 4.3. New Isolation Levels Suitable for Read-only Transactions 91

Tj that committed after Ti’s starting point as long as the View Consistency requirements are not

i violated. Additionally, let Tbe f ore denote a set of read-write transactions that committed before i Ti’s starting point and let Ta fter represent a set of read-write transactions that committed after Ti’s starting point, but before the commit point of Tj and whose effects have not been seen by Ti, i.e.,

i ∀k(Tk ∈ Ta fter) : (bi

i 1. If the invariant ReadSet(Ti) ∩WriteSet(Ta fter ∪ Tj) = 0/ holds or

2. If the invariant ReadSet(Ti)∩WriteSet(Tj) = 0/ holds and there is no read-write transaction Tk

in MVH ( j 6= k, i 6= k) such that bi

on Tk.

Otherwise, Ti is forced to see the database state as it existed by its starting point, i.e., Ti is obliged

i to read the most up-to-date version of an object produced by some read-write transaction in Tbe f ore.

In what follows, we represent the fact that Ti is allowed to read “forward” to observe the effects of

Tj by Ti →vc f r Tj. 

Again, it can be shown that Read Rule 3 produces only syntactically correct histories in the sense that read-only transactions see a transaction-consistent database state.

Theorem 6. In a multi-version history MVH containing a set of read-write transactions Tupdate such that all read-write transactions in Tupdate are serializable, each read-only transaction Ti satisfying

Read Rule 3 is serializable w.r.t. all transactions in Tupdate that created object versions that have been (either directly or indirectly) seen by Ti.

i Proof. As in previous proofs, let Ta fter represent a set of read-write transactions that committed i after Ti’s starting point and whose effects have not been seen by Ti and let Tf orward denote the set of read-write transactions that committed after Ti’s starting point, but whose effects have been observed by Ti. According to Definition 32, a read-only transaction Ti is said to be view consistent, if it serializes with the set of update transactions that produced values that are (either directly or indirectly) seen by Ti. Thus, View Consistency permits single-query cycles in the serialization 92 Chapter 4. Processing Read-only Transactions Efficiently and Correctly graph as long as they can be broken by removing a rw-edge between two read-write transactions that are involved in the cycle. Note that an rw-edge between two read-write transactions Tj0 and

Tj1 is formed due to a read operation by Tj0 followed by a conflicting write operation Tj1 , i.e., Tj1 installed the successor object version read by Tj0 . Now suppose, by the way of contradiction, that

MVH contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti is a read-only transaction and Tjn with n ≥ 0 is a read-write transaction. Additionally, suppose that the cycle cannot be broken by removing rw-edges between the involved read-write transactions. Given those prerequisites, Ti must have both an incoming and outgoing edge in order for the cycle to be produced. As in previous proofs, the incoming edge Tjn → Ti is from a read-write transaction Tjn that installed an object version read by

Ti and the outgoing edge Ti → Tj0 is to a read-write transaction Tj0 that wrote the successor version of an object read by Ti. By Read Rule 3, Ti is guaranteed to see either all or none of the effects of any read-write transaction and thus, both edges must involve distinct read-write transactions, i.e.,

Tj0 6= Tjn . A further prerequisite for a cycle to occur is that Tjn (directly or indirectly) wr- or ww- depends on Tj0 . According to Definition 11 and the specified version and commit order constraints, this dependency implies that Tj0 committed before Tjn in MVH, i.e., c j0

Rule 3 would be violated as it allows Ti to read “forward” and observe the effects of Tjn only if there is no read-write transaction Tj0 in MVH such that bi

Tjn write-read or write-write depends on Tj0 . Therefore, if a scheduler operates according to Read Rule 3, single-query cycles involving read-write transactions that rw-depend on each other cannot occur in MVH. Thus, Read Rule 3 guarantees view consistency to read-only transactions. 

We can now define our new IL that ensures Update Serializability correctness in addition to strict forward BOT data currency guarantees:

Definition 33 (Strict Forward BOT View Consistency). A multi-version history MVH over a set of read-only and read-write transactions is strict forward BOT view consistent, if the following conditions hold: 4.3. New Isolation Levels Suitable for Read-only Transactions 93

1. MVH is view consistent, and

2. if the pair ri[x j] and w j[x j] of a read-only transaction Ti and a read-write transaction Tj is in MVH, then either

(a) Requirement 2a of Definition 27 is true or

(b) bi

committed transaction Tk in MVH such that x j  xk, ck

To show that a multi-version history MVH provides strict forward BOT view consistency guar- antees, we associate a corresponding graph with MVH.

Definition 34 (Causal Dependency Strict Forward Read Single Query Multi-Version Serial- ization Graph). A causal dependency strict forward read single query multi-version serialization graph for a multi-version history MVH w.r.t. a read-only transaction Ti, denoted CD-SFR-SQ-

i MVSG(MVH, Ti), is a directed graph with nodes N = Tdepend ∪ Ti and labeled edges E such that:

1. There is an edge Ti → Tj (Ti 6= Tj), if Tj ?-depends on Ti.

2. There is an edge Ti → Tj (Ti 6= Tj), whenever there exists a pair of operations ri[x j] and w j[x j]

of a read-only transaction Ti and a read-write transaction Tj such that w j  wk and ck

3. There is an edge Ti → Tj (Ti 6= Tj) whenever there exists a pair of operations ri[x j] and

w j[x j] of a read-only transaction Ti and a read-write transaction Tj such that bi

c j

Tk in MVH such that x j  xk, ck

Theorem 7. A history MVH consisting of committed read-only and read-write transactions executes under Strict Forward BOT View Consistency, if for each read-only transaction Ti the corresponding

CD-SFR-SQ-MVSG(MVH, Ti) is acyclic.

Proof. The correctness of Theorem 7 can be proved using the same logical reasoning as given in the proof of Theorem 5.  94 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

To conclude this subsection, Table 4.1 summarizes the main characteristics of the newly defined

ILs.

4.4 Implementation Issues

We now propose four protocols that implement the newly defined ILs in a space- and time-efficient manner. Before doing so, however, we illustrate the key characteristics of the target system environ- ment for which our protocols are designed and present the general design assumptions that underlie the implementations of the ILs.

Data dissemination using both push and pull mechanisms is likely to become the prevailing mode of data exchange in mobile wireless environments. In the previous chapter, we have already elaborated on the fundamental principles and general operating guidelines of hybrid dissemination- based data delivery systems and therefore, we restrict our subsequent discussion on the imposed system properties and assumptions that are relevant for the design and performance of our pro- tocols. Central to any dissemination-based system is the contents and structure of the broadcast program. Due to its simplicity, we use a flat broadcast disk system [4] to generate the broadcast program which consists, in our configuration, of three types of segments: (a) an index segment,

(b) a data segment, and (c) a concurrency control information segment. To make the data dis- seminated self-descriptive, we incorporate an index into the broadcast program. We choose (1,m) indexing [78] as the underlying index organization method and broadcast the complete index once within each MIBC. To provide cache consistency in spite of server updates, each minor cycle is preceded with a concurrency control report or CCR that contains the read and write sets along with the values of newly created objects of read-write transactions that committed in the previous

MIBC. An entry in a CCR is a 4-tuple (TID,ReadSet,WriteSet,WriteSetValues), where TID de- notes the globally unique transaction identifier of some recently committed transaction Ti, ReadSet and WriteSet represent Ti’s read and write set, respectively, and WriteSetValues represents a list of pairs (xi,v) that maps each object version xi newly created by transaction Ti to its associated value v. Transactions stored in CCR are ordered by their commit time to ensure its efficient and correct 4.4. Implementation Issues 95

Newly defined Base Isolation Consistency Guarantees Currency Guarantees Isolation Level Level Each read-only transaction in Read-only transactions are re- MVH is required to serialize quired to observe a snapshot BOT (Full) with all committed read-write of committed data objects Serializability Serializability and all other read-only trans- that existed by their starting actions in MVH. points. Read-only transactions are re- quired to read from a database snapshot valid as of the time when they started. How- Each read-only transaction in ever, read-only transactions Strict Forward MVH is required to serialize (Full) are forced to read “forward” BOT with all committed read-write Serializability and observe the updates from Serializability and all other read-only trans- read-write transactions that actions in MVH. committed after their starting points as long as the serializ- ability requirement is not vio- lated by those reads. Enforces the same currency Each read-only transaction in requirements as the Strict MVH is required to serial- Forward BOT Serializabil- Update Serializ- Strict Forward ize with all committed up- ity level with the difference ability [61,162]/ BOT Update date transactions in MVH, but that read-only transactions are Weak Serializability does not need to be serial- obliged to issue forward reads Consistency [29] izable with other committed as long as the update serial- read-only transactions. izability requirements are not violated by those reads. Enforces the same currency Each committed read-only requirements as the Strict transaction in MVH is re- View Consistency Forward BOT Serializabil- quired to serialize with all Strict Forward / Update ity level with the difference committed update transac- BOT View Consistency [29] that forward reads of read- tions in MVH that had written Consistency / External Consis- only transactions are enforced values which have (either di- tency [159] whenever the view consis- rectly or indirectly) been seen tency criterion is not violated by the read-only transaction. by those reads.

Table 4.1: Newly defined ILs and their core characteristics. 96 Chapter 4. Processing Read-only Transactions Efficiently and Correctly processing by the clients. The data segment contains hot-spot data objects that are of interest to a large number of clients. The rest of the database is assumed to be accessed on-demand through a bandwidth-restricted back-channel. Figure 4.3 finally illustrates the basic layout of the broadcast program used in our system model which corresponds to the structure depicted in Figure 3.1(b).

Major Broadcast Cycle Minor BC1 Minor BC2 Minor BC3 Minor BC4 1 1 2 2 3 3 4 4

Data Segment1 Data Segment2 Data Segment1 Data Segment4 CCR CCR CCR CCR Index Index Index Index Segment Segment Segment Segment Segment Segment Segment Segment

CCR ID TID ...... FirstCCRListEntry ReadSet ... WriteSet ... FirstWriteSetValueListEntry ... OID NextListEntry ... ObjectValue NextListEntry ...

Figure 4.3: Organization structure of the broadcast program.

With respect to the client and server architecture, we assume a hybrid caching system for both system components to improve the performance of our protocols. In a hybrid caching system the cache memory available is divided into a page-based segment and an object-based segment. The server uses its page cache to handle fetch requests and to fill the broadcast disk with pages contain- ing hot-spot objects. The server object cache is utilized to save installation disk reads for writing modified objects onto disk. The latter is organized as a mono-version object cache similar to the modified object buffer (MOB) in [53]. With respect to concurrency control, the server object cache can be used to answer object requests in case a transaction-consistent page is not available from the client’s perspective. The client also maintains a hybrid cache scheme to get full advantage of both types. The client page cache is used to keep requested and prefetched database pages in volatile memory. We assume a single-version page cache that maintains up-to-date server pages. The client object cache, on the other hand, is allowed to store multiple versions of an object x. To simplify the description of our protocols, we assume that an object x can either be stored in a page p or in the object cache of the client. To judge about the correctness of a client read operation, each page p is assigned a commit timestamp CTS(p) that reflects the (logical) time when the last transaction 4.4. Implementation Issues 97 that updated or newly created an object x in page p committed. Analogous to the page cache, each version of an object maintained in the client object cache is associated with a commit timestamp reflecting the point in time when the version was installed by a committed read-write transaction.

4.4.1 Multi-Version Concurrency Control Protocol with BOT Serializability Guar- antees (MVCC-BS)

In this subsection, we present an algorithm that provides BOT Serializability to read-only trans- actions. To enforce database consistency, we assume that the state of the mobile database is exclusively modified by transactions that run with serializability requirements. We also assume that clients can only execute a single read-only transaction at a time. To avoid mixing up data consistency- and currency-related issues arising from connection failures with the basic working principles of our CC protocols, we assume that mobile clients do not suffer from intermittent con- nectivity and can always actively observe the broadcast channel. We refer the interested reader to [21] for a detailed description of three cache invalidation methods that could be easily adapted for use along with our subsequently defined CC protocols to prevent mobile clients from observ- ing inconsistent and out-dated data in case of communication failures or voluntary disconnections.

Finally, note that the following algorithm will build the fundamental basis for subsequent proto- cols that ensure weaker semantic guarantees than serializability and should therefore be studied carefully.

Our implementation of the BOT Serializability level allows concurrency control with nearly no overhead. For each read-only transaction Ti, the client keeps the following data structures and information for concurrency control purposes: (a) Ti’s startup timestamp, (b) Ti’s read set, and (c) an object invalidation list (OIL). The latter contains the identifiers and commit timestamps of objects that were created during Ti’s execution time. Note that in order to ensure the correctness of the MVCC-BS protocol, read-only transactions are not required to keep track of their read sets.

However, all subsequently defined protocols that are built upon the MVCC-BS protocol do require such information and we therefore decided to introduce and describe this structure’s operations 98 Chapter 4. Processing Read-only Transactions Efficiently and Correctly already here. Also note that all underlying data structures of our CC schemes are chosen for clarity of exposition rather than for efficient implementation.

The server data structures include the hybrid server cache, the CCR as described before, and the temporary object cache (TOB). The TOB is used to record the modified or newly created object versions of transactions that committed during the current MIBC. Additionally, the TOB is utilized to store “shadow” versions of transactions that are not yet committed. Whenever an MIBC is

finished, all versions of committed transactions are merged from the TOB into the MOB and the updated or newly created object versions will be available for the next MIBC.

Now we describe the protocol scheme by differentiating between client and server operations.

Client Operations

1. Read Object x by Transaction Ti on Client C

(a) Ti issues its first read operation.

Assign the number of the current MIBC to STS(Ti). Add x to Ti’s read set (ReadSet(Ti)).

(b) Requested object x is cache-resident in the page or object cache.

If the requested object is stored in page p, it can be read by Ti whenever p’s commit

timestamp CTS(p) is smaller than STS(Ti) or there is no entry of x with commit time-

stamp CTSOIL(x) in the object invalidation list (OIL) such that STS(Ti) ≤ TSOIL(x).

Otherwise, Ti looks for the entry of object x in the object cache. If some version j of

object x is in the object cache, Ti can read x j, provided the invariant CTS(x j) < STS(Ti) holds. Note that there is no requirement to check if there is an other version k of ob-

ject x in the client cache such that CTS(x j) < CTS(xk) and CTS(xk) < STS(Ti) since by assuming that clients do not run more than a single read-only transaction at anyone

time, only one version of object x with a commit timestamp smaller than the starting

timestamp of the read-only transaction may be useful for it and is therefore maintained

in the client object cache. The other object versions would only waste scarce memory

space and are therefore garbage-collected as mentioned below. If Ti reads some version

of x, add x to ReadSet(Ti). 4.4. Implementation Issues 99

(c) Requested object x is scheduled for broadcasting.

Read index of the broadcast to determine the position of the object on the broadcast. The

client is allowed to download the desired object x either if the commit timestamp of the

page p in which x resides is smaller than Ti’s starting point, i.e., CTS(p) < STS(Ti)

and there is no object version of x in OIL such that CTS(p) < CTSOIL(x) and

CTSOIL(x) < STS(Ti) or if there is no entry of x with commit timestamp CTSOIL(x)

in the object invalidation list (OIL) such that STS(Ti) ≤ TSOIL(x). If a consistent object of x cannot be located in the air-cache, the client proceeds with Option 1(d). Otherwise,

it reads the installed version of object x and adds x to ReadSet(Ti).

(d) Requested version of object x is neither in the local cache nor in the air-cache.

Send fetch request for object x along with STS(Ti) to the server. The server processes the client request as described below. As a reply, the client either receives a transaction-

consistent copy of a page p which contains the requested object x or alternatively, a

transaction-consistent version of x. If the request cannot be satisfied, the server notifies

the client and Ti must be aborted.

2. Concurrency Control Report Processing on Client C

CCRs are disseminated at the beginning of each MIBC. The client processes the CCR as fol-

lows: For each object x included in the write set of a read-write transaction Tj that committed in the last MIBC, an entry is added into OIL containing the identifier of object x along with

its commit timestamp. Additionally, the contents of the page and object cache is refreshed. If

object x kept in page p at client C was updated during the last MIBC, the old version of x is

overwritten by the newly created version. Otherwise, the updated version of x is installed into

the object cache, if x belongs to C’s hot-spot objects. If a prior version of object x becomes

useless for Ti, it is discarded from the object cache.

3. Transaction Commit

Transaction Ti is allowed to commit, if all read requests were satisfied and no abort notifica- tion was sent by the server. 100 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

Server Operations

1. Fetch Request for Object x From Client C

If the server receives a fetch request for object x from transaction Ti, the server first checks if the page p holding x is in the server cache. If p is cache-resident and the startup timestamp of

Ti is equal to the number of the current MIBC, the server will send page p to C after applying to p all pending MOB entries. Otherwise, the server searches for x in the MOB. If it finds

an entry for object x such that CTS(x) < STS(Ti), the server will send object x to the client.

Otherwise, if p is not cache-resident, but STS(Ti) equals the number of the current MIBC, the server reads p from disk and applies all pending modifications recorded in the MOB of

objects that reside in p to the page. If none of the above conditions holds, the fetch request

cannot be satisfied for consistency reasons and an abort message will be transmitted to the

client.

2. Integration of the TOB into the MOB

At the end of each MIBC, the newly created and updated versions of objects are merged into

the MOB. If objects exist in the MOB, their object values will be overwritten and timestamp

numbers will be updated.

3. Filling the broadcast disk server

The server fills the memory storage space allocated to contain the data and index segments

of the broadcast program at the beginning of each MBC. In doing so, the server proceeds

as follows: If the desired page p containing hot-spot objects is not in the page cache of

the server, it is read into the cache from the disk and thereafter, it is updated to reflect all the

modifications of its objects recorded in the MOB. At the end of this process, every data page p

included in the broadcast program is completely up-to-date, i.e., it contains the most current

versions of its objects. Further, the server creates a (1,m) index containing entries of objects

scheduled for broadcasting and stores it into all index segments of the broadcast program.

The concurrency control segment of the broadcast program is updated at the beginning of

each MIBC just before its broadcast. This segment is filled with the CCR as described above. 4.4. Implementation Issues 101

4.4.2 Multi-Version Concurrency Control Protocol with Strict Forward BOT Seri- alizability Guarantees (MVCC-SFBS)

After the description of a MVCC protocol that ensures BOT Serializability consistency, we extend this scheme to provide Strict Forward BOT Serializability. Recall that the Strict Forward BOT Seri- alizability level differs from the BOT Serializability degree by requiring that read-only transactions observe the updates of read-write transactions that committed after their respective starting points provided that Read Rule 1 is satisfied. To implement the latter requirement, we adopt a technique used by the multi-versioning with invalidation scheme in [123] and associate a read forward flag or RFF with each read-only transaction Ti in MVH that is initially (i.e., at transaction start time) set to true and indicates whether Ti has read a version of an object that was later modified by a read-write transaction Tj. If such an event occurs, RFF of Ti is set to false and Tj’s commit time- stamp is recorded in a variable called read forward stop timestamp or RFSTS. Equipped with the latter information, the scheduler can efficiently determine which version of a requested object x a read-only transaction Ti needs to observe by applying the following algorithm:

1 begin 2 if Read-only transaction Ti requests to read object x and RFF is set to false then 3 Ti reads the latest committed object version of x with a commit timestamp CTS(x) that is smaller than RFSTS(Ti) 4 else 5 Ti reads the most recent object version of x 6 end

Algorithm 4.1: Algorithm used by the MVCC-SFBS scheduler to map an appropriate object version to some read request for object x issued by read-only transaction Ti.

4.4.3 Multi-Version Concurrency Control Protocol with Strict Forward BOT Up- date Serializability Guarantees (MVCC-SFBUS)

Remember that the Update Serializability IL is less restrictive than the Full Serializability IL by allowing different read-only transaction to see different serialization orders of read-write transac- tions. This weaker requirement affects the forward read behavior of read-only transactions running 102 Chapter 4. Processing Read-only Transactions Efficiently and Correctly under Strict Forward BOT Update Serializability. As for the MVCC-SFBS protocol, the mobile client has to determine for each active read-only transaction Ti as to whether it needs to observe the effects of a read-write transaction Tj that committed during Ti’s execution time. To this end, each read-only transaction maintains two additional data structures: (a) First, an object version write prohibition list or OVWPL is associated with each read-only transaction Ti. An OVWPL is a set of pairs (OID,CTS) where OID denotes the identifiers of objects whose values Ti is not allowed to see and CTS represents the logical time when the transactions that modified or created the objects committed. The OVWPL of an active read-only transaction Ti is updated whenever a new CCR appears on the broadcast channel. (b) Additionally, for each active read-only transaction Ti the client maintains an object version read prohibition list or OVRPL that keeps track of the objects read by read-write transactions that committed during Ti’s execution time and whose effects may not be seen by Ti. The identifiers of objects created by a read-write transaction Tj along with the corresponding commit timestamp (in case of the OVWPL structure) have to be added to the Ti’s OVRPL and OVWPL, if any of the following conditions holds:2

1. ReadSet(Ti) ∩ WriteSet(Tj) 6= 0/

2. OVWPL(Ti) ∩ ReadSet(Tj) 6= 0/

3. OVWPL(Ti) ∩ WriteSet(Tj) 6= 0/

4. OVRPL(Ti) ∩ WriteSet(Tj) 6= 0/

Condition 1 implies that in order for Ti to read “forward” on objects written by Tj, the intersec- tion between Ti’s read set and Tj’s write set must be empty. Condition 2 states that Tj may not have seen any objects contained in Ti’s OVWPL. It ensures that Ti will only see the effects of Tj if there wr is no wr-dependency (Tk δ Tj) between any read-write transaction Tk whose updates are registered in Ti’s OVWPL and Tj. Condition 3 states that Tj may not have overwritten an object whose cor- responding object identifier is contained in Ti’s OVWPL. This rule ensures that Ti will only see the effects of Tj if there is no read-write transaction Tk whose updates are registered in Ti’s OVWPL and Tj ww-depends on it. Condition 4 states that Tj must not have overwritten an object that is

2 Note that whenever Tj has updated an object x that is already listed in Ti’s OVWPL, the entry of x is not modified and the protocol proceeds with the next object written by Tj (if there is any). 4.4. Implementation Issues 103 included in Ti’s OVRPL. This condition guarantees that Ti will only see the updates of Tj if there rw exists no rw-dependency (Tk δ Tj) between any read-write transaction Tk (that conflicts with Ti and those read operations are included in Ti’s OVRPL) and Tj. A read-only transaction running under Strict Forward BOT Update Serializability sees a correct state of the database, if the transaction scheduler running on the client uses the following algorithm in order to map object requests to appropriate object versions:

1 begin 2 if Read-only transaction Ti requests to read object x and x is registered in Ti’s OVWPL then 3 Ti reads the latest committed object version of x with a commit timestamp CTS(x) that is smaller than the commit timestamp of the entry of object x in OVWPL 4 else 5 Ti reads the most recent object version of x 6 end

Algorithm 4.2: Algorithm used by the MVCC-SFBUS scheduler to select an appropriate object version whenever read-only transaction Ti wants to read an object x.

4.4.4 Multi-Version Concurrency Control Protocol with Strict Forward BOT View Consistency Guarantees (MVCC-SFBVC)

View Consistency is the weakest IL that provides transaction consistency to read-only transactions and has the potential to maximize the number of forward reads of read-only transactions without vi- olating transaction correctness. To determine whether an active read-only transaction Ti is required to see the updates of a read-write transaction Tj that successfully finished its execution during Ti’s lifetime, the satisfaction of the following conditions have to be tested:

1. ReadSet(Ti) ∩ WriteSet(Tj) 6= 0/

2. OVWPL(Ti) ∩ ReadSet(Tj) 6= 0/

3. OVWPL(Ti) ∩ WriteSet(Tj) 6= 0/

If any of those conditions are violated, the write set of Tj must be registered in Ti’s OVWPL and

Ti is not allowed to read “forward” and observe the effects of Tj. Otherwise, Ti is allowed to read 104 Chapter 4. Processing Read-only Transactions Efficiently and Correctly an object x written by Tj provided that there exists no later version of the respective object that Ti is allowed to observe as well. In order to decide which version of a requested object x a read-only transaction Ti needs to observe, MVCC-SFBVC uses the same algorithm as the MVCC-SFBUS protocol (see Algorithm 4.2). As those conditions that decide whether a read-only transaction Ti is allowed to read “forward” on some read-write transaction Tj are a proper subset of the ones formulated for the MVCC-SFBUS scheme, it is obvious that the MVCC-SFBVC protocol provides strictly stronger currency guarantees than MVCC-SFBUS. Further, it is easy to see that MVCC-

SFBVC has lower time and space overheads than MVCC-SFBUS since the former does not need to maintain the OVRPL data structure. Hence, we expect that the MVCC-SFBVC scheme outperforms the MVCC-SFBUS protocol in our performance study.

4.5 Performance Results

The performance study aims at measuring the absolute and relative performance of our proposed

MVCC protocols in a wireless hybrid data delivery environment. Additionally, we compare our protocols with the previously devised concurrency control schemes [123,139] to detect performance trade-offs among these different schemes and their underlying consistency and currency guarantees.

We analyze the performance of the new ILs’ implementations and other protocols using two key metrics, namely transaction commit rate and transaction abort rate. We restricted the subsequent analysis to those two performance metrics since they help us to evaluate the protocols’ performance and overhead in the most condensed form. For an extended version of the performance analysis, we refer the interested reader to [137].

4.5.1 System Model

Our simulation parameters are similar to the ones taken in previous performance studies in the

field of mobile data broadcasting and conventional distributed database systems [4, 7, 58]. The simulation model consists of the following core components: (a) a central broadcast server, (b) numerous mobile clients, (c) a database hosted by the broadcast server, (d) a broadcast program, 4.5. Performance Results 105 and (e) a hybrid network allowing the server to establish point-to-point and point-to-multipoint conversations with mobile clients. The components are briefly described below.

Broadcast Server and Mobile Clients:

The broadcast server and the mobile clients are at the heart of the simulator and both are mod- eled as simple facilities where events are generated and handled according to pre-defined rules.

Various selected events are charged in terms of CPU instructions and are then converted into time using the server’s and clients’ processor speeds which are specified in million instructions per sec- ond or MIPS. Events on the processors, disks, network interfaces, etc., are executed in a FIFO fashion after a specified delay using an event queue, i.e., if any of such devices is heavily utilized, the event may be scheduled later than its specified time. Time is measured in terms of broadcast ticks which is defined as the time it takes to broadcast a disk page of 4,096 bytes in size.

The clients’ CPU speed is chosen to be 100 MIPS and the server’s CPU speed is set to 1,200

MIPS. These values reflect typical processor speeds of mobile PDAs and high-performance work- stations observed in production systems about three years ago when this study was conducted. Note, however, that despite the fact that the used CPU speeds are somewhat out-dated, we believe that our gathered simulation results still approximate the relative performance differences between the investigated CC protocols (when deployed in a hybrid data delivery system using today’s hardware technology) very well, since the actual sizes of system components are less important than their relative sizes when mirroring the characteristics of real systems. The performance improvement of mobile and stationary computing devices has been in line with each other over the recent past. As mentioned above, we have associated CPU instruction costs with various system and user events which are itemized in Table 4.2. Note that the client is charged an inter- and intra-transaction think time between two consecutive transactions and transaction operations, respectively, to simulate the users’ decision-making process before the user proceeds to the next transaction and transaction operation, respectively.

The client cache size is chosen to be 2% of the database size. As in previous simulation stud- ies that examined various performance issues within the context of stationary distributed database systems, we modeled the client cache as a hybrid cache by dividing it into a page-based segment 106 Chapter 4. Processing Read-only Transactions Efficiently and Correctly and an object-based segment as described before. The client page cache is managed by the LRU replacement policy and the client object cache is organized by an eviction algorithm called P [7].

P is an offline page replacement algorithm that uses the knowledge of objects’ access probabilities to determine the cache replacement victims. We use this knowledge whenever the cache capacity is reached to evict those objects that have the lowest access probability. Client page and object cache freshness is achieved by downloading recently modified objects from the CCR segment of the broadcast cycle. The server cache size is set to 20% of the database size and it is also divided into a relatively small page cache and a relatively large object cache (see Table 4.2 for the split ratio between both caches). The server object cache is further split into a modified object cache

(MOB) and temporary object cache (TOB), with the latter not being explicitely modeled as a sep- arate facility with its own dedicated storage space and cache replacement policy since we expect its storage requirements to be rather neglectable. The MOB is modeled as a single-version object cache as described in [53]. It is managed as a simple FIFO buffer and the server page cache, on the other hand, by a LRU replacement policy.

Database and Broadcast Program:

A relatively small database size is used in order to make the simulations of our complex mobile broadcasting architecture computationally feasible with today’s computer technology. Therefore, the database is modeled as a set of 10,000 objects. The size of each disk page is 4 KB and a page contains 40 objects of 100 bytes each. The database objects are stored on 4 disks on the server and page fetch requests are uniformly assigned to a disk independent of the workload. Note that disks are only available at the server, i.e., we assume diskless mobile clients. The disk itself is modeled as a shared FIFO queue on which operations are scheduled in the order they are initiated.

Disk delays are composed of a queuing delay and a disk access and transfer delay, where the disk access delay is the sum of seek time (defined as the time it takes for the disk head to reach the data track) and rotational latency (defined as the time it takes for the data sector to rotate under the disk head). We use an average seek time of 4.5 ms, an average rotational latency of 3.0 ms, and a disk bandwidth of 40 Mbps. These performance values correspond to the Quantum Atlas

10K III Ultra SCSI disk [107]. The broadcast program determines the structure and the contents 4.5. Performance Results 107

Server Database Parameters Parameter Parameter Value Database size (DBSize) 10,000 objects Object size (OBSize) 100 bytes Page size (PGSize) 4,096 bytes Server Cache Parameters Server buffer size (SBSize) 20% of DBSize Page buffer memory size 20% of SBSize Object buffer memory size 80% of SBSize Page cache replacement policy LRU Object cache replacement policy (MOB) FIFO Server Disk Parameters Fixed disk setup costs 5,000 instr Rotational speed 10,000 RPM Media transfer rate 40 Mbps Average seek time (read operation) 4.5 ms Average rotational latency 3.0 ms Variable network costs 7 instr/byte Page fetch time 7.6 ms Disk array size 4 Client/Server CPU Parameters Client CPU speed 100 MIPS Server CPU speed 1,200 MIPS Client/Server page/object cache lookup costs 300 instr Client/Server page/object read costs 5,000 instr Register/Unregister a page/object copy 300 instr Register an object in prohibition list 300 instr Prohibition list lookup costs 300 instr Inter-transaction think time 50,000 instr Intra-transaction think time 5,000 instr

Table 4.2: Summary of the system parameter settings – I. of the underlying broadcast disk. We assume a flat single-version broadcast disk whose contents is cyclically disseminated along with the (1,m) index and CCR to the client population. Since we want to model a hybrid data delivery environment, only the hottest 20% of the database is broadcast.

At the beginning of each MBC, a dedicated broadcast disk server is filled with data pages containing the most popular 20% of data objects. Each MBC is subdivided into five MIBCs and each MIBC, in turn, consists of one-fifth of the data to be broadcast within an MBC, a (1,m) index to make the data self-descriptive, and a CCR as described before.

Hybrid Network:

Our modeled network infrastructure consists of three communication paths: (a) a broadcast 108 Chapter 4. Processing Read-only Transactions Efficiently and Correctly channel, (b) back-channels or uplink channels from the clients to the server, and (c) downlink chan- nels from the server to the clients. The network parameters of those communication paths are modeled after a real system such as Hughes Network System’s DirecPC3 [70]. We set the default broadcast bandwidth to 12 Mbps and the point-to-point bandwidth to 400 Kbps downstream and to 19.2 Kbps upstream. Both point-to-point connections were modeled as unshared resources and in order to create network contention, we restricted the number of uplink and downlink communi- cation channels to two in each direction. With respect to point-to-point communication costs, each network message has a latency that is divided into four components: (a) CPU costs for sending the message, (b) queuing delay, (c) transmission time, and (d) CPU costs for receiving the message.

CPU costs at each communication end consist of a fixed number of instructions (i.e., 6,000 instruc- tions) and a variable number of instructions that are charged according to the size of the message

(i.e., 7.6 instructions/byte). The uplink and downlink network paths are modeled as a shared FIFO queue on which operations are scheduled in the order they are initiated. The network bandwidth of the uplink and downlink channels multiplied by the message size is used to determine how long the the network queue is utilized for sending the message. Communication costs incurred by transmit- ting broadcast messages to the client population do not include queuing delays4 since we assume that there is no congestion in the broadcast medium. Finally, the network parameters used in the simulation study are once more summarized in Table 4.3. Note that we used an end-to-end latency for the transmission of request messages (from the client to the server and back again) of only

20 ms. Although the DirecPC system has much higher message latency in reality (approximately

375 ms) [64], underestimating the propagation delay helped us speed up the simulation runs.

4.5.2 Workload Model

Transaction processing is modeled in our simulation study as in a stock trading and monitoring database. Data objects are modified at the server by a data workload generator that simulates the effects of multiple read-write transactions. To produce data contention in the system, 100 data

3DirecPC is now being called DIRECWAY [71] 4All other previously mentioned communication costs incurred by disseminating data are, however, charged to the server and clients. 4.5. Performance Results 109

Client Cache Parameters Parameter Parameter Value Client cache size (CCSize) 2% of DBSize Client object cache size 80% of CCSize Client page cache size 20% of CCSize Page cache replacement policy LRU Object cache replacement policy P Broadcast Program Parameters Number of broadcast disks 1 Number of objects disseminated per MBC 20% of DBSize Number of index segments per MBC 5 Number of CCRs per MBC 5 Bucket size 4,096 bytes Bucket header size 96 bytes Index header size 96 bytes Index record size 12 bytes OID size 8 bytes Network Parameters Broadcast bandwidth 12 Mbps Downlink bandwidth 400 Kbps Uplink bandwidth 19.2 Kbps Fixed network costs 6,000 instr Variable network costs 7 instr/byte Propagation and queuing delay 20 ms Number of point-to-point uplink/downlink channels 2

Table 4.3: Summary of the system parameter settings – II. 110 Chapter 4. Processing Read-only Transactions Efficiently and Correctly objects are modified by as much as 20 transactions during the course of an MBC and their com- mit points are uniformly distributed over this period of time. There is no variance in the number of objects written by any of the initiated read-write transactions, i.e., each transaction updates ex- actly 5 data objects. Objects read and written by read-write transactions are modeled by using a

Zipf distribution [168] with parameter θ = 0.80 and each object of the database is accessible and updateable by the read-write transactions. The ratio of the number of write operations versus the number of read operations is fixed at 0.25, i.e., only every fifth operation issued by the server re- sults in an object modification. Read-only transactions are modeled as a sequence of 10 to 50 read operations. As for the read and write operations at the server, the access probability of client read operations follows a Zipf distribution with parameter θ = 0.80 (and θ = 0.95, respectively), i.e., about 75% (90%) of all object accesses are directed to 25% (10%) of the database. To account for the impact on the communication and server resources when the client sends a data request to the server, we model a multi-client environment consisting of 10 mobile clients. For simplicity and as noted before, we assume that each mobile client runs only one read-only transaction at the time. We adopt a parameter, called uplink usage threshold [6], whose value determines whether a client may explicitly request an object even though it is scheduled for broadcasting. The chosen threshold of 100% means that an object version cannot be requested from the server if it is listed in the broadcast program. We have chosen an abort variance of 100% which means that whenever a read-only transaction aborts due to a read conflict, the restarted transaction reads from a different set of objects. To conclude the description of the modeled hybrid data delivery system, Table 4.4 summarizes the simulator’s workload parameter settings and Figure 4.4 shows the interrelationship between the various system’s components.

4.5.3 Experimental Results of the Proposed CC Protocols

All performance results presented are derived from executing 10,000 read-only transactions after the system reached its steady state. The results come from artificially generated traces, i.e., they give only an intuition about the performance of our ILs implementations, but may not represent a real application’s behavior. We now give a brief interpretation of the experimentally measured 4.5. Performance Results 111

Workload Parameters Parameter Parameter Value Read-write transactions size (number of operations) 25 Read-only transactions size (number of operations) 10 – 50 Server data update pattern (Zipf distribution with θ) 0.80 Client data access pattern (Zipf distribution with θ) [0.80, 0.95] Number of updates per MBC 1.0% of DBSize Number of concurrent read-only transactions per client 1 Uplink usage threshold 100% Abort variance 100%

Table 4.4: Summary of the workload parameter settings.

Major Broadcast Cycle

Minor BC1 Minor BC2 Minor BC3 Minor BC4 Minor BC5 Broad− 1

5 1 1 2 2 3 3 4 4 5 5 cast Data Data Data Data Data Segment Segment Segment Segment Segment 2 CCR 1 CCR 2 CCR 3 CCR 4 CCR 5 Mana− 4 Index Index Index Index Index ger 3 Broadcast Channel / Air−Cache CCRs CCRs CCRs Data Objects Data Objects Data Objects Object Cache Cache Cache

Page Cache Cache Cache Hit Cache Hit Cache Hit Transaction Transaction Transaction Database Generator Generator Generator Pull− Cache Miss Cache Miss Cache Miss Manager Pull Manager Pull Manager Pull Manager

Server Client 1 Client 2 ... Client 10

Figure 4.4: An overview of the simulation model used to generate the performance statistics. results w.r.t. the aforementioned metrics.

As Figures 4.5(a) and 4.5(b) show, the transaction throughput rate of all protocols decreases along the x-axis as the number of objects accessed by read-only transactions rises. Increasing the transaction length results in longer transaction execution times and hence fewer transaction com- mits per second. Furthermore, longer read-only transactions might abort at a later point of their execution which results in higher abort costs, thus also reducing the transaction throughput. Ad- ditionally, as transaction execution time progresses, the likelihood that object read requests can be satisfied by some component of the database system (client cache, air-cache, or server memory) decreases. Thus, apart from increased abort costs, higher abort rates are another consequence of longer transaction execution times. As the tabular results show, the performance difference be- tween the MVCC-SFBVC and the other protocols widens slightly (w.r.t. the MVCC-SFBS and 112 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

MVCC-SFBVC scheme) / significantly (w.r.t. the MVCC-BS protocol) with increase in read-only transaction length. The growing performance penalty is caused by a disproportionate increase in the number of messages sent per committed read-only transaction since fewer client cache and air-cache hits occur.

The increase in the transaction abort rate as a function of the transaction length is depicted in

Figures 4.6(a) and 4.6(b). In terms of the abort rate, the relative difference between the protocols decreases when growing the transaction size from 10 to 50 read operations. The reason for the nar- rowing gap between the protocols is related to a decline in the relative difference in the number of prohibition list entries or PLEs (being defined as the number of data objects that a read-only trans- action is forbidden to read “forward” on by its commit point) with increasing transaction length. 4.5. Performance Results 113

12 Concurrency Control Protocol MVCC-SFBVC (0.80) Transaction MVCC-SFBUS (0.80) MVCC-SFBUS MVCC-SFBS MVCC-BS 10 MVCC-SFBS (0.80) Length MVCC-BS (0.80) Access Pattern (θ = 0.80)

8 10 0.5% 0.8% 55.0% 20 2.4% 1.6% 93.4% 30 4.2% 6.9% 99.6% 6 40 7.4% 7.7% - 50 9.0% 9.0% - 4 Throughput / Second

2

0 10 20 30 40 50 Transaction Length (a) Client data access pattern (θ = 0.80)

18 Concurrency Control Protocol MVCC-SFBVC (0.95) Transaction 16 MVCC-SFBUS (0.95) MVCC-SFBUS MVCC-SFBS MVCC-BS MVCC-SFBS (0.95) Length 14 MVCC-BS (0.95) Access Pattern (θ = 0.95)

12 10 0.4% 1.2% 29.7% 20 2.1% 5.0% 70.8% 10 30 3.6% 5.2% 91.6% 40 4.1% 8.1% 98.0% 8 50 4.5% 10.2% 99.8% 6

Throughput / Second 4

2

0 10 20 30 40 50 Transaction Length (b) Client data access pattern (θ = 0.95)

Figure 4.5: Throughput (transaction commits per second) achieved by MVCC-BS, MVCC-SFBS, MVCC- SFBUS, and MVCC-SFBVC under the baseline setting of the simulator. While graphs show absolute sim- ulation values, tables present the performance penalty of the three ILs relative to the best performing CC protocol, namely MVCC-SFBVC. 114 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

14 Concurrency Control Protocol Transaction 12 MVCC-SFBVC (0.80) MVCC-SFBUS MVCC-SFBS MVCC-BS MVCC-SFBUS (0.80) Length MVCC-SFBS (0.80) Access Pattern (θ = 0.80) 10 MVCC-BS (0.80) 10 12.9% 17.9% 94.5% 20 5.9% 8.9% 84.9% 8 30 4.6% 7.0% 76.2% 40 3.4% 6.0% 70.9% 6 50 2.3% 4.5% 68.5%

Aborts / Second 4

2

0 10 20 30 40 50 Transaction Length (a) Client data access pattern (θ = 0.80)

14 Concurrency Control Protocol Transaction MVCC-SFBUS MVCC-SFBS MVCC-BS 12 Length MVCC-SFBVC (0.95) Access Pattern (θ = 0.95) 10 MVCC-SFBUS (0.95) MVCC-SFBS (0.95) 10 6.3% 17.9% 91.6% MVCC-BS (0.95) 20 5.8% 14.9% 82.7% 8 30 3.5% 6.6% 72.9% 40 2.4% 5.0% 66.0% 6 50 2.4% 4.3% 62.1%

Aborts / Second 4

2

0 10 20 30 40 50 Transaction Length (b) Client data access pattern (θ = 0.95)

Figure 4.6: Wasted work (aborts per second) performed by MVCC-BS, MVCC-SFBS, MVCC-SFBUS, and MVCC-SFBVC under the baseline setting of the simulator. While graphs again show absolute simulation values, tables present the performance penalty of the three ILs relative to the best performing CC protocol, namely MVCC-SFBVC.

4.5.4 Comparison to Existing CC Protocols

In what follows, we present the results of experiments comparing the transaction throughput and abort rate of the best and worst performing protocol, namely MVCC-SFBVC and MVCC-BS, with

CC schemes previously proposed in the literature, which are suitable for mobile database sys- tems [123, 124, 139]. A suite of protocols, namely the multi-versioning method, multi-versioning with invalidation method, and invalidation-only scheme, all providing serializability along with 4.5. Performance Results 115 varying currency guarantees to read-only transactions were devised in [123, 124]. Out of those protocols, we selected the invalidation-only (IO) scheme for the comparison analysis. The other two protocols were left out due to their similarity to the MVCC-BS and MVCC-SFBS schemes.

Additionally, we compare our protocols with the APPROX algorithm [139], which provides View

Consistency along with EOT data currency guarantees to read-only transactions. In [139], two im- plementations, namely F-Matrix and R-Matrix, were developed for the APPROX algorithm. We have selected a variant of the F-Matrix, called F-Matrix-No, for the comparison analysis, since it showed the best performance results among the four protocols, namely Datacycle [67], R-Matrix,

F-Matrix, and F-Matrix-No, experimentally compared in [139]. F-Matrix-No differs from the F-

Matrix protocol by ignoring the cost of broadcasting concurrency control information for each database object, and therefore can be used as a baseline for measuring the best possible perfor- mance of the protocol’s underlying guarantees.

Figure 4.7 shows the relationship between our proposed protocols (printed in bold and italics) and the ones published in literature (printed in normal type). As the figure indicates, both the IO scheme and F-Matrix-No protocol ensure EOT data currency. Therefore, they are not expected to perform well especially under the 0.95 workload where the clients’ access pattern is highly skewed, thus resulting in frequent conflicts.

We are now ready to present the results of the comparison study. As shown in Figures 4.8 and 4.9, MVCC-SFBVC turns out to be superior over the other compared protocols. On average,

MVCC-SFBVC outperforms the F-Matrix-No by 90.9% for the 0.95 workload and by 85.7% for the 0.80 workload. The average performance degradation of the IO scheme relative to the MVCC-

SFBVC protocols is 95.5% for the 0.95 workload and 93% for the 0.80 workload. The reason for the relatively poor performance of the F-Matrix-No and IO schemes is related to the strong data currency requirements these two protocols impose. The F-Matrix-No performs moderately better than the IO scheme since the former processes read-only transactions with weaker consistency guarantees than the later. While the IO scheme forces read-only transactions to abort whenever they had read some object version that was later updated by some read-write transaction, the constraints imposed by the F-Matrix-No protocol are less severe. Here, only those read-only transactions need 116 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

MVCC−BS MVCC−SFBS Serializability Multiversioning Multiversioning Invalidation−only Scheme with Scheme Invalidation Scheme

Update Serializability MVCC−SFBUS

Data Consistency Guarantees Read Committed & Transaction MVCC−SFBVC F−Matrix−No Consistency

BOT BOT with read EOT forward obligation Data Currency Guarantees

Figure 4.7: Protocols studied with their respective data consistency and currency guarantees. to be aborted which had observed an object version than was later updated by some read-write transaction Tj, and Tj belongs to the set of transactions whose effects have been (either directly or indirectly) seen by the respective read-only transaction. Thus, the transactions potentially aborted by the F-Matrix-No protocol form a proper subset of the ones aborted by the IO scheme. 4.5. Performance Results 117

12 MVCC-SFBVC (0.80) Concurrency Control Protocol MVCC-BS (0.80) Transaction MVCC-BS F-Matrix-No IO scheme 10 F-Matrix-No (0.80) Length IO Scheme (0.80) Access Pattern (θ = 0.80) 8 10 55.0% 35.4% 64.3% 20 93.4% 86.8% 97.9% 30 99.6% 97.2% 100% 6 40 100% 100% 100% 50 100% 100% 100% 4 Throughput / Second

2

0 10 20 30 40 50 Transaction Length (a) Client data access pattern (θ = 0.80)

18 MVCC-SFBVC (0.95) Concurrency Control Protocol 16 MVCC-BS (0.95) Transaction MVCC-BS F-Matrix-No IO scheme F-Matrix-No (0.95) Length IO Scheme (0.95) 14 Access Pattern (θ = 0.95) 12 10 29.7% 53.4% 76.3% 20 70.8% 96.5% 99.5% 10 30 91.6% 99.9% 100% 8 40 98.0% 100% 100% 50 99.8% 100% 100% 6

Throughput / Second 4

2

0 10 20 30 40 50 Transaction Length (b) Client data access pattern (θ = 0.95)

Figure 4.8: Absolute and relative throughput results gained by the comparison study under the baseline setting of the simulator. Relative performance results presented in tabular form are related to the MVCC- SFBVC protocol. 118 Chapter 4. Processing Read-only Transactions Efficiently and Correctly

20 Concurrency Control Protocol 18 Transaction MVCC-BS F-Matrix-No IO scheme 16 Length Access Pattern (θ = 0.80) 14 10 55.0% 35.4% 64.3% 12 20 93.4% 86.8% 97.9% 10 30 99.6% 97.2% 100% MVCC-SFBVC (0.80) 40 100% 100% 100% 8 MVCC-BS (0.80) 50 100% 100% 100% F-Matrix (0.80) Aborts / Second 6 IO Scheme (0.80) 4

2

0 10 20 30 40 50 Transaction Length (a) Client data access pattern (θ = 0.80)

30 Concurrency Control Protocol MVCC-SFBVC (0.95) Transaction 25 MVCC-BS (0.95) MVCC-BS F-Matrix-No IO scheme F-Matrix (0.95) Length IO Scheme (0.95) Access Pattern (θ = 0.95) 20 10 29.7% 53.4% 76.3% 20 70.8% 96.5% 99.5% 15 30 91.6% 99.9% 100% 40 98.0% 100% 100% 50 99.8% 100% 100% 10 Aborts / Second

5

0 10 20 30 40 50 Transaction Length (b) Client data access pattern (θ = 0.95)

Figure 4.9: Absolute and relative performance penalty of the MVCC-BS, F-Matrix, and IO scheme compared to the MVCC-SFBVC protocol in terms of transaction aborts per second. 4.6. Conclusion and Summary 119

4.6 Conclusion and Summary

In this chapter, we have presented formal definitions of four new ILs suitable for managing read- only transactions in mobile broadcasting environments. We have given concrete examples on how the individual ILs differ from each other, provided evidence that their underlying read rules produce correct histories and identified possible anomalies that may arise in using weaker ILs such as Strict

Forward BOT Update Serializability or Strict Forward BOT View Consistency than (Full) Serializ- ability for processing read-only transactions. We have also described a suite of MVCC protocols that efficiently implement the newly defined ILs in a hybrid data delivery environment. Finally, the implementations of our defined ILs were compared by means of a performance study which experi- mentally confirmed the hypothesis that protocols with weaker correctness requirements outperform implementations of stronger ILs as long as they enforce the same data currency guarantees. A com- parison study has shown that the MVCC-SFBVC scheme is the best concurrency control mechanism for cacheable transactions executed in mobile broadcasting environments. Thus, MVCC-SFBVC should always be the first choice for processing read-only transactions in mobile dissemination- based environments whenever read-only transactions are not required to serialize with the complete set of committed transactions in the system. Otherwise, the MVCC-SFBS protocol is to be pre- ferred. 120 Chapter 4. Processing Read-only Transactions Efficiently and Correctly “The process of scientific discovery is, in

effect, a continual flight from wonder.”

– Albert Einstein

Chapter 5

Efficient Client Caching and Prefetching Strategies to

Accelerate Read-only Transaction Processing

5.1 Introduction and Motivation

Remember that a mobile hybrid data delivery network is a communication medium that combines push- and pull-based data delivery in an efficient way by broadcasting the data objects that are of interest to a large client population and unicasting less popular data objects only when they are requested by the clients. While a combined push/pull data delivery mode has many advantages, it also suffers from two major disadvantages: (a) The client data access latency depends on the length of the broadcast cycle for data objects that are fetched from the broadcast channel. (b) Since most of the data requests can either be satisfied by the clients themselves or the broadcast channel, the server lacks clear knowledge of the client access patterns. While the latter weakness can be dimin- ished by subscribing to the broadcast server and by sending usage profiles to it [4, 38, 114] or by dynamically adjusting the broadcast content on the basis of “broadcast miss” information received through direct data requests submitted by clients through point-to-point channels [147], the former can be relaxed by designing and deploying an efficient cache replacement and prefetching policy that is closely coupled with the transaction manager of the mobile client. Such a tight coupling

121 122 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions of the transaction manager with the client cache replacement and prefetching manager is required since multi-versioning has been identified as a viable approach to diminish the interference between concurrent read-only and read-write transactions by servicing read requests with obsolete, but nev- ertheless appropriate object versions (see Chapter 4). As client cache resources are typically scarce and to support MVCC efficiently, the client cache manager’s task is to maintain only those object versions in the cache that are very likely to be accessed in the future and to evict all those that are unlikely to be referenced and have become useless from the CC point of view. While statistical in- formation on the object popularity can be exploited to decide on the likelihood of an object version to be accessed in the future, it requires in-depth scheduling knowledge of the transaction manager to determine whether an object version can be safely evicted from the cache since it is not any more required for servicing “out-of-order” read requests.

Unfortunately, exploiting multi-versioning to improve the degree of concurrency between read- write transactions is not as effective as for read-only transactions since read-write transactions typ- ically need to access up-to-date (or at least “close” to current) object versions to provide serializ- ability guarantees. Therefore, in the remainder of the chapter, we concentrate on multi-versioning as the means of improving the performance of mobile applications issuing read-only transactions exclusively. For a detailed qualitative and quantitative evaluation of the potential of using multi- version data caching and prefetching to efficiently satisfy the data requests of mobile read-write transactions, we refer the interested reader to Subsections 6.4.1 and 6.5.5.4 of the thesis.

5.1.1 Multi-Version Client Caching

So far, we have indicated that multi-versioning is a graceful and practicable approach to process read-only and read-write transactions concurrently. In what follows, we highlight various issues a mobile cache and prefetch manager needs to consider, so that key performance metrics such as transaction throughput and transaction abort rate are maximized and minimized, respectively.

Since data caching is an effective, if not the most effective, and, therefore, indispensable way of reducing transaction response times [33], cache replacement policies have been extensively studied in the past two decades for stationary database management systems [44, 83, 85, 99, 115]. 5.1. Introduction and Motivation 123

As conventional caching techniques are inefficient for wireless, broadcast-based networks where communication channels form an intermediate memory level between the clients and server and where communication quality varies over space and time, mobile cache management strate- gies [4,5,89,90,149,166] have been designed, that are tailored to the peculiarities and constraints of mobile communication systems. However, to our knowledge, none of the proposed caching policies designed either for the stationary or for the mobile client-server architecture tackles the problem of managing multi-version client buffer pools efficiently. Multi-version client caching differs from mono-version caching by at least two key observations: (a) The cost/benefit value of different ver- sions of a data object in the client cache may vary over time depending on the storage behavior of the server, i.e., if the server discards an object version useful for the client, this version’s cost/benefit value increases since it cannot be re-acquired from the server. (b) Versions of different data objects may, for the very same reason, have dissimilar cost/benefit values despite being equally likely to be referenced.

The following example illustrates the aforementioned peculiarities. Suppose a diskless mobile client executes a read-only transaction Ti with BOT Serializability consistency (see Section 4.3.2 for its formal definition), i.e., Ti is always forced to observe the most recent object versions that existed by its starting point. Assume the start timestamp STS of Ti is 1, i.e., the non-decreasing sequence number of the current MIBC is 1, and the database consists of four objects {a,b,c,d}. The client cache size is small and may hold only two object versions. Further, it is up to the client how many versions of each object it maintains. For space and time efficiency reasons, the database server holds a restricted number of versions, namely the last two committed versions of each data object. Also assume the client’s access pattern is totally uniform, i.e., each object is equally likely to be accessed and at the end of MIBC 5 the client cache holds the set of objects {a0,b0} and the server keeps objects {a1,a3,b0,b1,c0,c4,d2,d5}. Note that the subscripts assigned to object versions correspond to the identifier of the MIBC that existed when the transaction that created the respective version committed. Now suppose the client needs to read a transaction-consistent version of object c. Since there is no cache-resident version of object c, the client fetches the missing object from the server.

By the time the object arrives at the client, the local cache replacement policy needs to select a 124 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

Newly Created a 0 b 0 a 1 b 1 d 2 a 3 c 4 d 5 a 6 d 6 Object Versions c 0 d 0

a 0 a 0 a 1 a 0 a 1 a 1 a 3 a 1 a 3 a 1 a 3 a 3 a 6

Server Cache b 0 b 0 b 1 b 0 b 1 b 0 b 1 b 0 b 1 b 0 b 1 b 0 b 1 Content c 0 c 0 c 0 c 0 c 0 c 4 c 0 c 4 c 0 c 4 d 0 d 0 d 0 d 2 d 0 d 2 d 0 d 2 d 2 d 5 d 5 d 6

Network Miss a 0 Miss b 0 Miss c 0 Client Cache a a a b a b a b a c Content 0 0 0 0 0 0 0 0 0 0 MIBC Timeline 0 1 2 3 4 5 6

Figure 5.1: An example illustrating the peculiarities of multi-version client caching. replacement victim to free some cache space. In this case, a judicious cache replacement strategy would evict b0 since it is the only object version that can be re-acquired from the server, i.e., a cache replacement policy suitable for a multi-version cache needs to incorporate both probabilistic information on the likelihood of object references in the future and data re-acquisition costs. To conclude and to deepen the understanding of the described caching scenario, Figure 5.1 specifies the example’s underlying changes on the database state of the server and illustrates their impact on the server and client cache contents as time progresses.

5.1.2 Multi-Version Client Prefetching

Apart from demand-driven caching and judicious eviction of object versions from the client cache, another technique that can be used to reduce on-demand fetches is data prefetching by which the client optimistically fetches object versions from the server and/or broadcast channel into the local cache in expectation of a later request. Since prefetching, especially if integrated with caching, strongly affects transaction response time, various combined caching and prefetching techniques have been studied for stationary database systems [32, 83, 119, 151]. Work on prefetching in mo- bile data broadcasting environments has been conducted by [5, 89, 90, 149]. Again, as for caching, prefetching mechanisms proposed in the literature are inefficient for mobile dissemination-based applications utilizing MVCC protocols to provide consistency and currency guarantees to read-only transactions. The reasons are twofold: (a) First, existing algorithms on data prefetching for wireless 5.2. System Design and General Assumptions 125 data dissemination such as PT, its approximation APT [5], Gray [89,90], OSLA and its generaliza- tion the W-step look-ahead scheme [149] are based on simplified assumptions such as no database updates and no use or availability of back-channels to the server. (b) Second, and even more im- portantly, all previous prefetching strategies were designed for mono-version database systems and, therefore, lack the ability to make proper prefetching decisions in a multi-version environment. In contrast, we base our model on more realistic assumptions and develop a prefetching algorithm that is multi-version compliant. As prefetching may unfold its total strength if deeply integrated with data caching, our prefetching algorithm uses the same cost/benefit metric for evaluating prefetching candidates as the cache replacement algorithm. To ensure that the prefetching algorithm does not hurt, but rather improve performance, we allow prefetches of only those object versions that have been recently referenced and whose cost/benefit value exceeds the value of any currently cached object version.

5.1.3 Outline

The chapter is structured as follows: Section 5.2 describes the system model and general design assumptions underlying this research study. In Section 5.3 a new multi-version integrated caching and prefetching policy, called MICP, is introduced along with an implementable approximation of

MICP that we refer to as MICP-L. Section 5.4 reports on detailed experimental results that show the superiority of our algorithm compared to previously proposed online caching and prefetching policies and quantifies the performance penalty of MICP compared to an offline probability-based caching and prefetching algorithm P-P having full knowledge of the client access behavior. The chapter’s conclusions and summary can be found in Section 5.5.

5.2 System Design and General Assumptions

The focus of this chapter is to develop efficient cache management and prefetching algorithms which provide mobile clients with good performance in a dissemination-based environment. In what follows, we present a brief overview of the core components of the system architecture for 126 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions which MICP is developed and give the general assumptions about the environment it is designed.

As the envisaged system architecture does not differ from that discussed in Chapter 4, the reader may skip Sections 5.2.1 and 5.2.2 in sequential reading.

5.2.1 Data Delivery Model

We have chosen a hybrid data delivery system as underlying network architecture for MICP since a hybrid push/pull scheme has the ability to mask the disadvantages of one data delivery mode by exploiting the advantages of the other. Since broadcasting is especially effective when used for popular data, we assume that the server broadcasts only such data that is of interest to the majority of the client population. Our broadcast structure is logically divided into three segments of varying sizes: (a) index segment, (b) data segment, and (c) CCR segment. Each MIBC is supplemented with an index to eliminate the need for the mobile clients to continuously listen to the broadcast in order to locate the desired data object on the channel. We choose (1,m) indexing [78] as the underlying index allocation method by which the whole index, containing, among other things, a mapping between the objects disseminated and the identifiers of the data pages in which the respective objects appear, is broadcast m times per MBC. The data segment, on the other hand, solely contains hot-spot data pages. Note that we assume a flat broadcast disk approach for page scheduling, i.e., each and every data page is only broadcast once within an MBC. For data consistency reasons, we model the broadcast program so that all data pages disseminated are a consistent snapshot as of the beginning of each MBC. Thus, the modified or newly created object versions committed after the beginning of an ongoing MBC will not be included in any data page disseminated during the current MBC.

To guarantee cache consistency despite server updates, each MIBC is preceded with a concurrency control report or CCR as described below.

The second core component of the hybrid data delivery system are the point-to-point channels.

Point-to-point channels may be utilized by mobile clients to request locally missing, non-scheduled data objects from the broadcast server. Also clients are allowed to use the back-channel to the server when a required data object is scheduled for broadcasting, but its expected arrival time is above the uplink usage threshold [6] dynamically set up by the server. This optimization helps clients improve 5.2. System Design and General Assumptions 127 their response times.

5.2.2 Client and Server Cache Management

Conventional caching and prefetching strategies are typically page-based as the optimal unit of transfer between systems resources are pages with sizes ranging from 8 KB to 32 KB [55]. In mo- bile data delivery networks caching and prefetching data on a coarse granularity such as pages is inefficient due to the physical constraints and characteristics of the mobile environment. For exam- ple, the communication in client-server direction is handicapped by low bandwidth communication channels. Choosing page-sized granules to be the unit of transfer for data uploads would be a waste of bandwidth compared to sending objects of much smaller size in case of a low degree of locality.

Since a broadcast server typically serves a large client population and each client tends to have its own set of frequently accessed data objects, it is not unrealistic to assume that the physical data or- ganization of the server may not comply with the individual access pattern of the clients. Therefore, and in order to increase the hit ratio of the client cache and in order to save scarce uplink bandwidth resources, we deploy our caching and prefetching schemes primarily on the basis of data objects.

However, to allow clients to cache pages as well, we opt for a hybrid client cache consisting of a small-size page cache and a large-size object cache. While the page cache is primary used as temporary storage memory to extract and copy requested or prefetched object versions into the ob- ject cache, the object cache’s task is to efficiently maintain those object versions, i.e., it is used as permanent storage space. Note that our intuition behind such a cache structure was experimentally confirmed by a performance study [42] demonstrating that an object-based caching architecture is superior to a page-based one when physical clustering is poor and the client’s cache size is small relative to the size of the database, which is typically the case in mobile environments. The client object cache itself is partitioned into two variable-size segments: (a) the REC (re-cacheable) seg- ment and (b) the NON-REC (non-re-cacheable) segment. As their names imply, the REC segment is used to store object version that may be re-fetched from the server, while the NON-REC segment is exclusively used to maintain object versions that cannot be re-acquired from the server as they have been evicted from it. To avoid wasting scarce client memory space, the size of both segments is 128 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions not fixed to some predefined value, but rather can be dynamically adjusted (within certain bounds) to the needs of the moment (see Section 5.3.1 for more information on that). Figure 5.2 finally illustrates the above described organization of the client cache along with information on how the page and object caches are implemented.

Head Tail Page Cache b f m d k s a

Root Root b2 c0

a8 c3 x1 v2 Object Cache r0 e2 a5 d3 e1 p3 y0 i1

f3 x5 t 6 k4 l1 w7 g5 h9 Binary Min−Heap Binary Min−Heap

REC NON−REC NON−REC Adjustable Upper Cache Size Cache Size Boundary Delimiter

Figure 5.2: Organization of the client cache.

As for the mobile clients, the broadcast server manages its cache as a hybrid system that permits both page and object caching. To increase cache memory utilization, the object cache is designed to be much larger than the page cache and the former is partitioned into two segments: (a) a large modified object cache (MOB) and (b) a small temporary object cache (TOB). The structure of the

MOB is similar to the one described in [53] with the exception that multiple versions of objects may be maintained to reduce the number of data miss-induced transaction aborts. The TOB, on the other hand, is used to as temporary storage space for uncommitted and recently committed object versions, and the latter are merged into the MOB at the end of each MIBC. Again note that the use of both cache types (page and object caches) allows us to exploit the benefits of each. While the page cache is useful for storing data to the broadcast disk, handling installation reads [118], etc., the object cache is attractive for recording object modifications and servicing object requests. 5.2. System Design and General Assumptions 129

5.2.2.1 Data Versioning

To be able to distinguish between different versions of the same data object and to correctly syn- chronize read-only transactions with committed and/or currently active read-write transactions in a mobile multi-version environment, each object version is assigned the identifier of the transac- tions that wrote the version, i.e., a write operation on an object x by transaction Ti installs object version xi. As we assume that any transaction Ti cannot modify an object x multiple times dur- ing its lifetime, the used notation identifies each newly created object version non-ambiguously.

Since multi-versioning imposes additional memory and processor overheads on the mobile clients and the server, we assume that the number of versions maintained in the involved memory levels is restricted. For clients it is sufficient to maintain at most two versions of each database object at any time as we assume that clients do not execute read-only transactions in parallel. In con- trast, the server may need to maintain every object version in order to guarantee that any read-only transaction can read from a transaction-consistent database snapshot. Since such an approach is impracticable, we assume that the server maintains a fixed number of versions in the MOB (see

Section 5.4.5 for a performance experiment on this issue).

5.2.2.2 Client Cache Synchronization

Hoarding, caching, or replicating data in the client cache is an important mechanism to improve the data availability, system scalability, application response time, and to reduce the power consump- tion of mobile clients. However, data updates at the server make cache consistency a challenge. An effective cache synchronization and update strategy is needed to ensure consistency and freshness between the the primary or source data at the server and the secondary or data cached at the client and the original data at the server. Although invalidation messages are space and time efficient compared to propagation messages, they lack the ability to update the cache with new object ver- sions. Due to the inherent tradeoffs between propagation and invalidation, we employ a hybrid of the two techniques. On the one hand, the broadcast server periodically disseminates a CCR which is a simple structure that contains, in addition to concurrency control information, copies of of all those object versions that have been created during the last MIBC (see Section 4.4 for more in- 130 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions formation). Based on those reports, mobile clients operating in connected mode can easily update their caches at low costs. However, since CCRs contain only concurrency control information w.r.t. the last MIBC, those reports are useless for cache synchronization of recently reconnected clients that had missed one or more CCRs. To resolve this problem, we assume that the server maintains the update history of the last w MBCs as proposed in [20, 21]. This history is used for client cache invalidation as follows: when a mobile client wakes up from a disconnection, it waits for the next

CCR to appear and checks whether the following equation is valid: IDCCR,c < IDCCR,l + w, where

IDCCR,c denotes the timestamp of the current CCR and IDCCR,l represents the timestamp of the lat- est CCR report received by the client. If so, a dedicated invalidation report (IR) can be requested by the client (containing the identifiers of the data objects that have been modified or newly created during the course of disconnection) to invalidate its cache properly. If the client was disconnected for more than w MBCs, the entire cache contents has to be discarded upon reconnection.

5.3 MICP: A New Multi-Version Integrated Caching and Prefetching

Algorithm

The design of MICP consists of two complementary algorithms that behave synergistically. The

first algorithm, responsible for selecting cache replacement victims, is called PCC (Probabilistic

Cost-based Caching) and the second one dealing with data prefetching is termed PCP (Probabilistic

Cost-based Prefetching). While PCC may be employed without PCP in order to save scarce CPU processing and battery power of mobile devices, PCP’s potential can be exploited by coupling it with a cache replacement policy that uses the same or a similar metric for decision making.

5.3.1 PCC: A Probabilistic Cost-based Caching Algorithm

The major goal of any cache replacement policy designed either for broadcasting or for unicasting environments is to minimize the average response time a user/process experiences when requesting data objects. Traditional cache replacement policies try to achieve this goal by making use of two different approaches: (a) the first category requires information from the database application. 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 131

That information can either be obtained from the application directly or from the query optimizer that processes queries of the corresponding application. (b) The second category of replacement algorithms bases its decisions on observations of past access behavior. The algorithm proposed in this paper belongs to the latter group, extends the LRFU policy [98, 99] and borrows from the

2Q algorithm [85]. As for LRFU policy, PCC quantifies the probability of an object being re- referenced in the future by associating with each object x a score value that incorporates the effects of the frequency and recency of past references on that object. More precisely, PCC computes a combined recency and frequency value or CRF for each object x whenever it is referenced by a transaction according to the following formula:

−(λ·(IDre f ,c−IDre f ,l (x))) CRFn+1(x) = 1 + 2 ·CRFn(x), (5.1)

where CRFn(x) is the computed combined recency and frequency value of object x over the last n references, IDre f ,c denotes the monotonically increasing reference identifier associated with the current object reference, IDre f ,l(x) is the reference identifier assigned to object x when it was last accessed, and λ (0 ≤ λ ≤ 1) is a kind of “slide controller” that allows PCC to weigh the importance of recency and frequency information for the replacement selection. Note that if λ converges to- wards 0 PCC behaves more like an LFU policy and, contrarily, with λ approaching 1 it acts more like an LRU policy.

In contrast to the LRFU algorithm, PCC bases its replacement decisions not only on recency and frequency information of historical reference patterns, but additionally makes use of three fur- ther parameters (besides the future reference probability of objects as expressed by CRF). First, and in order to reflect the situation that instantaneous access costs of data objects scheduled for broadcasting are non-constant due to the serial nature of the broadcast medium, PCC’s replace- ment decisions are sensitive to the actual state and contents of the broadcast cycle. More precisely,

PCC accounts for the costs of re-acquiring object versions by evicting those versions that have low probabilities of access and low re-acquisition costs. To provide a common metric for comparing costs of ejecting object versions that can be re-cached from the broadcast channel and/or database 132 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions server, we measure re-acquisition costs in terms of broadcast units. Since we assume that the con- tent and organization of the broadcast program does not change significantly between consecutive

MBCs and the clients are aware of the position of each object version in the MBC due to (1,m) indexing, determining the number of units, i.e., broadcast ticks, until an object version re-appears on the channel is straightforward. However, estimating the cost of re-fetching a requested version from the server is more difficult since that value depends on parameters such as the current load on the communication network and server as well as the effect of server caching. To keep our cache replacement algorithm as simple as possible, we use the uplink usage threshold [6] as a simple guideline for approximating data fetch costs. Since the uplink usage threshold provides a tuning knob to control the server and network utilization and, thus, affect data fetch costs, its dynamically

fixed value correlates with the data fetch latency a client experiences when requesting data objects from the server. If the threshold is high, the system is expected to operate under a high workload and, therefore, data retrieval costs are high as well. In what follows, we denote the re-acquisition cost of an object version xi at time t by RCt (xi). A second parameter that PCC uses to make cache replacement decisions is the update proba- bility of data objects. As noted before, multi-version database systems suffer from high processing and storage overheads if the number of versions maintained by the server for each object is not restricted. However, limiting the number of versions negatively affects the likelihood of data re- quests from the clients being satisfied by the server. To provide mobile clients with information on the probability that an object x is being updated during the next MBC, the server needs to estimate that value. It does so by using a well-known exponential aging method. At the end of each MBC, the server (re-)estimates the update probability of any object x that has been modified during the completed MBC by using the following formula:

α UPn+1(x) = (1 − α) ·UPn(x) + , (5.2) IDMBC,c − IDMBC,l(x)

where IDMBC,c is the non-decreasing identifier of the current MBC, IDMBC,l(x) denotes the identifier of the MBC where object x was last updated, UPn(x) represents the update probability of object x 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 133 based on its n previous updates, and α (0 ≤ α ≤ 1) is an aging factor to adapt to changes in the data update patterns. The higher α, the more important are recent updates of x. Last but not least, a cache replacement policy that wants to be effective in maintaining multi- version client caches needs to take into account the server’s version storage policy. Besides the update probability of each data object, the version maintenance strategy of the server affects the likelihood that an obsolete object version can be re-acquired once evicted from the client cache.

The more versions of an object x are kept by the server, the higher the probability that the server can satisfy requests for specific versions of x. PCC incorporates the versioning policy of the server by means of two complementary methods: (a) it computes re-acquisition costs of in-memory object versions based on their re-fetch probabilities (see Equations 5.4 and 5.5) and (b) it takes care of non-re-cacheable object versions by placing them into their own dedicated partition of the client object cache, namely the NON-REC segment. Re-cacheable object versions, on the other hand, are maintained in the REC segment of the client object cache as noted above.

The reason for cache partitioning is to prevent undesirable replacement of non-re-cacheable ver- sions by referenced or prefetched re-cacheable object versions. With regard to the size of the cache partitions, we experimentally established that NON-REC should not exceed 50% of the overall client cache size, i.e., REC should never be smaller than 50% of the available amount of client mem- ory space. The justification for those values is as follows: the majority of users issuing read-only transactions want to observe up-to-date (or at least “close” to current) object versions [41,145,163], i.e., they usually initiate read-only transactions with either BOT or strict forward BOT data currency guarantees (see Chapter 4). The assumptions that clients do not execute more than one read-only transaction at a time and transactions are issued with at least BOT data currency requirements imply that at their starting points only up-to-date object versions are useful for them, i.e., the NON-REC segment of the client cache is empty at this stage. As transactions progress, more and more useful object versions may become non-re-cacheable and need to be placed into NON-REC. Since the storage space needed to maintain non-re-cacheable object versions is not known in advance and depends on factors such as the transaction size, the user’s/transaction’s data currency requirements, the rate at which objects are being updated, etc., PCC adapts to this situation by changing the size 134 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions of NON-REC dynamically. That is, as demand for more storage space in NON-REC arises, PCC dynamically extends the size of the NON-REC segment by re-allocating object slots from REC to

NON-REC as long as its size does not exceed 50% of the overall client cache size. Without this upper bound, the system performance could degrade due to insufficient cache space reserved for up-to-date or nearly up-to-date (re-cacheable) versions. It is important to note that this cache struc- ture suits read-write transactions as well, since they have similar data requirements as read-only transactions with the exception that potentially fewer non-current object versions are requested for correctness reasons (see Section 6.4.1 for further discussions).

As all of the aforementioned parameters influence replacement decisions, it is obvious that there is a need for a combined performance metric to enable comparison of those values that would be meaningful for the client cache manager. To this end, we combine the estimates given above into a single performance metric, called probabilistic cost/benefit value (PCB), which is computed for each cache-resident object version xi at eviction time t as follows:

PCBt (xi) = CRFt (x) · (Thit (xi) + Tmiss(xi)). (5.3)

In the above formula, CRFt (x) denotes the re-reference probability of object x at time t, Thit is the weighted time in broadcast ticks it takes to re-fetch object version xi if evicted from the cache, and

Tmiss represents the weighted time required to re-process the completed read operations of the active read-only transaction, denoted Tj in what follows, in case it needs to be aborted since xi is not any more system-resident and thus cannot be provided to Tj.

The weighted time to service a request for object version xi that hits either the air-cache or the server memory is the product of the following parameters:

Nver(xi) Thit (xi) = (1 −UP(x) ) · RC(xi), (5.4)

where Nver(xi) denotes the number of versions of object x with CTSs equal to or older than xi currently being kept or potentially allowed to be kept by the server according to its version manage- ment policy. Further on, we compute Tmiss(xi) as a weighted approximation of the amount of time 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 135 it would take the client to restore the current state of Tj for which xi is useful in case Tj has to be aborted due to a fetch miss of xi:

Nver(xi) Tmiss(xi) = UP(x) · Tre−exe, j, (5.5)

where Tre−exe, j denotes the sum of the estimated retrieval and processing times it would take the client to re-execute Tj’s data operations (if aborted) and is computed as follows:

L T = C · N + (1 −C ) · N · MBC , (5.6) re−exe, j hit op, j hit op, j 2

where LMBC represents the average length of the MBC, Chit denotes the average cache hit rate of the client, and the expression Nop, j symbolizes the number of read operations executed so far by read-only transaction Tj. As Formula 5.6 indicates, we assume that the average latency to fetch a non-cache-resident object version into the client memory takes half a broadcast period independent of whether that object appears on the broadcast channel or has to be requested through point-to-point communication. We opted for this simplification to refrain the algorithm from further complexity inflation.

The complete PCC algorithm invoked upon a reference to an object version xi is illustrated below. Algorithm 5.1 contains a number of functions/procedures used to modularize the code.

The function segment(xi) determines the segment of the client cache in which object version xi is or will be maintained, the procedure select victim(CSi) selects and evicts the object version with the lowest PCB value from the cache segment CSi and the function retrieval latency(xi) returns the estimated time to service a fetch request for object version xi.

5.3.2 PCP: A Probabilistic Cost-based Prefetching Algorithm

While PCC achieves the goal of improving transaction response times by caching demand-requested object versions close to the database application, PCP tries to further reduce fetch latency by pro- 136 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

Notations: CSi: Client object cache segment i, i ∈ {REC,NON-REC}. T(xi): Estimated weighted time to service a fetch request for object version xi.

begin if xi is already cache-resident then break else /* determine replacement victim */ if there is no free space for xi in segment(xi) then select victim(segment(xi)) insert xi into the free or recently freed cache slot compute CRFn+1(x) according to Equation 5.1; IDre f ,l(x) ←− IDre f ,c end

Function segment(xi) begin if xi is or will be stored into cache segment “REC” then return “REC” else return “NON-REC” end

Procedure select victim(CSi) begin min ←− 0.0; foreach object version xi in segment CSi do T(xi) ←− retrieval latency(xi); PCBt (xi) ←− CRF(x) · T(xi); if PCBt (xi) ≤ min then victim ←− xi; min ←− PCBt (xi) evict the determined replacement victim from cache; end

Algorithm 5.1: Probabilistic Cost-based Caching (PCC) Algorithm. actively loading useful object versions with high access probability and/or high re-acquisition costs into the client cache in anticipation of their future reference. As uncontrolled prefetching without reliable information might not improve, but rather harm the performance, the greatest challenge of

PCP is to decide when and which object version to prefetch and which cache slot to overwrite with the prefetched version when the cache is full. PCP tackles those challenges as follows: in order 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 137

Function retrieval latency(xi) begin if segment(xi) = “REC” then calculate Thit (xi) according to Equation 5.4; calculate Tmiss(xi) by using Equation 5.5 else /* segment(xi) = “NON-REC” */ Thit (xi) ←− 0.0; compute Tmiss(xi) according to Equation 5.5 with UP(x) set to 1 T(xi) ←− Thit (xi) + Tmiss(xi); return T(xi) end

Algorithm 5.1: Probabilistic Cost-based Caching (PCC) Algorithm (cont’d). to behave synergistically with PCC, PCP bases its prefetching decisions on the same performance metric, namely PCB. Since calculating PCB values for every object version that flows past the client is very expensive, if not infeasible, PCP computes those values only for a small subset of the potential prefetching candidates, namely recently referenced objects.

The reason for choosing this heuristic is the assumption that reference sequences exhibit tem- poral locality [40]. Temporal locality states that once an object has been accessed, there is a high probability that the same object (either the same or different version) will be accessed again in the near future. To decide whether an object has recently been referenced, clients need to maintain historical information on past object references. As will be explained later, we assume that clients retain such information for the last r distinct object accesses where r depends on the actual client cache size. Based on this statistical data, PCP selects its prefetch candidates by a simple policy.

In order for a disseminated object version xi to qualify for prefetching, there must exist any recent entry for object x in the reference history. The exact decision how recent an object reference has to be in order for the object to qualify for prefetching is left up to the client since the prefetching decision process is computationally expensive and has to be aligned with the client’s resources. If the object qualifies for prefetching, PCP computes xi’s PCB value and compares it with the PCP values of all currently cached object versions. If xi’s PCB value is greater than the least PCB value of all cache-resident object versions then xi is prefetched and replaces the lowest valuable version. 138 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

As for the PCC algorithm, prefetch candidates compete for the available cache space only with those versions that belong to the same cache segment.

Apart from prefetching current and non-current versions of recently referenced objects, PCP downloads from the broadcast channel all useful versions of data objects that will be discarded from the server by the end of the MBC. The intuition behind this heuristic is to minimize the num- ber of transaction aborts caused by fetch requests that cannot be satisfied by the server. A viable approach to reducing the number of fetch misses is to cache those versions at the client before they are garbage-collected by the server. To implement this approach, mobile clients need information as to whether a particular object version will be disseminated for the last time on the broadcast chan- nel. There are basically two ways how clients could receive such information: (a) First and most conveniently, the server indicates whether an object version is about to be garbage-collected. That information could be provided, for example, by adding a bit field in the header of each disseminated data page containing a bit for each object version stored in the data page that indicates whether it will be evicted from the server at the end of the MBC. (b) Alternatively, clients could determine whether an object version becomes non-re-cacheable by keeping track of the object version history and using knowledge of the server’s version storage policy. As the latter approach may consume lots of valuable client storage space, we opt for the first approach. To summarize, the complete pseudo-code of PCP is depicted in Algorithm 5.2 being presented below. To improve the read- ability of PCP algorithm, its pseudo-code is modularized into two components: (a) the algorithm’s main procedure and (b) the function min pcb(CSi) that returns the PCB value of the object version with the lowest PCB value of all the versions currently kept in cache segment CSi.

5.3.3 Maintaining Historical Reference Information

It has been noted that MICP takes into account both recency and frequency information on past data references in order to select cache replacement victims. Similar to LRFU, MICP maintains CRF values on a per-object basis that capture information on both recency and frequency of accesses.

However, in order for MICP to be effective, such values need to be retained in client memory not only for cache-resident objects, but also for evicted data objects. The necessity to keep historical 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 139

Notations:

CSi: Client object cache segment i, i ∈ {REC,NON-REC}. Tj: Read-only transaction currently run by the client. p: A disseminated data or CCR page.

begin foreach object version xi resident in p do if xi is useful for Tj and xi is not cache-resident and (there exists a CRF value for xi at the client or xi will be garbage-collected by the server at the end of the current MBC) then if there is a free cache slot in segment(xi) then insert xi into segment(xi) else T(xi) ←− retrieval latency(xi); PCBt (xi) ←− CRF(x) · T(xi); if PCBt (xi) > min pcb(segment(xi)) then select victim(segment(xi)); insert xi into the slot of the recently evicted object version

end

Function min pcb(CSi) begin min ←− 0.0; foreach object version xi in CSi do T(xi) ←− retrieval latency(xi); PCBt (xi) ←− CRF(x) · T(xi); if PCBt (xi) ≤ min then min ←− PCBt (xi) return min end

Algorithm 5.2: Probabilistic Cost-based Prefetching (PCP) Algorithm. 140 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions information of a referenced object even after all versions of this object have been evicted from cache was first recognized by [115] and was termed “reference retained information problem”. This problem arises from the fact that in order to gather both recency and frequency information, clients need to keep history information on recently referenced objects for some time. This is in particular required for determining the frequency of object references. If CRF values are maintained only for cached data objects and the size of the client cache is relatively small compared to the database size, then there exists a danger that MICP might over-estimate the recency information since frequency information is rarely available. On the other hand, storing reference information consumes valuable memory space that could otherwise be used for storing data objects.

To limit the memory size allocated for historical reference information, O’Neil et al. [115] sug- gest storing that information only for a limited period of time after the reference had been recorded.

As reasonable rule of thumb for the length of this period they use the Five Minute Rule [55]. How- ever, applying it in a mobile environment may be inappropriate for the following reason: a time- based approach for keeping reference information ignores the available cache size and reference behavior of the client. For example, if a client operates in disconnected mode due to lack of net- work coverage, its processing may be interrupted because a data request cannot be satisfied by the local cache. In such a situation the client needs to wait until reconnection for transaction processing to continue. Since disconnections might exceed 5 minutes, all the reference information will be lost during such a period. On the other hand, if the client cache size is small, the reference information on objects may have to be discarded even sooner than 5 minutes after their last reference. To re- solve the problem of determining a reasonable guideline for maintaining CRF values, we conducted a series of experiments. We figured out that clients with a cache size in the range of 1 to 10% of the database size should maintain reference information on all recently referenced objects that would

fit into a cache if it were about 5 times as large as its actual size (see Figure 5.6). Clearly, due to its time-independence such a rule avoids the aforementioned problem of discarding reference infor- mation during periods when clients are idle. Second, it limits the amount of memory required for storing historical information by coupling the retained information period to the client cache size. 5.3. MICP: A New Multi-Version Integrated Caching and Prefetching Algorithm 141

5.3.4 Implementation and Performance Issues

The previous section has shown that MICP bases its replacement and prefetching decisions on a number of factors combined into the PCB value. However, this metric is dynamic since it changes at every tick of the MBC. Although in theory one could obtain the required values while a page is being transmitted, such an approach would be much too expensive. To reduce overhead, we propose that the estimate of PCB for each cached data object is updated either only when a replacement victim is selected or at fixed points in time such as the end of an MIBC. While experimenting with our simulator, we noticed that both approaches are capable of remarkably reducing processing overhead while providing good performance results. However, we favor the latter technique since it may allow MICP to compute PCB values less frequently. In what follows, we refer to the version of MICP that calculates PCB values periodically as MICP-L where L stands for “light”.

Several statistical parameters are required when calculating PCB values for cache-resident ob- ject versions. While most of them can be acquired at the client-side, UP and Nver values are best obtained directly from the server. Although clients could individually determine UP values for database objects, this approach would be too expensive in terms of CPU overhead and power con- sumption. Instead, we propose that the server centrally calculates UP values and periodically broad- casts them. Nver values, on the other hand, can only be determined with knowledge on the version storage policy of the server. To inform clients on how many versions of an object the server guar- antees to maintain, the server assigns a backward version counter to each object version. When a new object version xi is created at the server, its version counter is initialized to some value v which equals the number of versions of object x the server is willing to maintain in its memory. Addi- tionally, the counters of existing versions of x are decremented by 1 at both the server and mobile clients. If the value of the counter is zero, the object version is selected for garbage-collection by the server and copied from the REC into the NON-REC segment if stored in the client cache.

In addition to reducing processing overhead by restricting the frequency of calculating PCB values, MICP requires a data structure that efficiently maintains the object versions along with their PCB values. Like many other cache replacement algorithms, MICP can be implemented with two (binary) min-heaps (see Figure 5.2) that maintain the ordering of object versions stored in the 142 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

NON-REC and REC segments, respectively, by their PCB values. Using min-heaps that contain the object version with the smallest PCB value at the root allows MICP to make cache replacement decisions in O(1) time. Insert and delete operations take at most O(log2 n) time, where n denotes the number of object versions maintained in the respective cache partition. Thus, the time complexity of each cache replacement operation is O(log2 n), which is similar to that of the LFU policy but considerably higher than that of the LRU. As noted before, PCB values are re-calculated at fixed time periods. Rebuilding the min-heaps has a time complexity of O(n log2 n).

5.4 Performance Evaluation

We studied and compared MICP’s performance with other online and offline caching and prefetch- ing algorithms numerically through simulation, and not analytically because the effects of such parameters as transaction size, client cache size or total number of versions maintained for each each object depend on a number of internal and external system parameters that cannot be precisely estimated by mathematical analysis. The simulator set-up and the generated workloads are based on the same system model that was previously used for evaluating the performance of implementa- tions of various new ILs defined to provide well-defined data consistency and currency guarantees to read-only transactions (see Chapter 4). To gain an insight into the efficiency of our proposed client caching and prefetching policy, we extended the simulator with a set of popular caching and prefetching algorithms in addition to MICP and MICP-L. In what follows, we briefly describe the main characteristics of the simulator on which our experimental results are based. The reader who has studied the description of the simulator model given in Chapter 4 may skip the related subsections in this part of the thesis and continue with Section 5.4.3 on page 147.

5.4.1 System Model

The simulation model consists of the following core components: (a) a broadcast server hosting a central database, (b) mobile clients, (c) a broadcast disk, and (d) a hybrid network, which are 5.4. Performance Evaluation 143 briefly described below and are modeled analogous to those of the simulation model presented in

Chapter 4.

Both the broadcast server and the mobile clients are modeled as consisting of a number of subcomponents including a processor, volatile cache memory, and magnetic disks, with the latter being only available to the broadcast server, i.e., we assume diskless mobile clients. Data is stored on 4 disks and data accesses are uniformly distributed among the disks by means of a shared FIFO queue. The unit of data transfer between the server and disks is a page of 4 KB and the server keeps a total of 250 pages in its stable memory. The size of an object is 100 bytes and the database consists of a set of 10,000 objects. To reflect the characteristics of a modern disk drive, we experimented with the parameters from the Quantum Atlas 10K III disk [107]. The client CPU speed is set to 100

MIPS and the server CPU speed is 1,200 MIPS, which have been typical processor speeds of mobile and stationary computing devices when this study was conducted two and a half years ago. We have associated CPU instruction costs with various events as listed in Table 5.1. The client cache size is set to 2% of the database size and the server cache size to 20% of the database size. As described in Section 5.2.2, we model the client cache as a hybrid system consisting of both a page-based and object-based segment. The page-based segment is managed by an LRU replacement policy and the object-based segment by various online and offline cache replacement strategies including

MICP and MICP-L. Similarly, the server cache is partitioned into a page cache and a modified object cache (MOB). The page cache is managed using an LRU policy and the MOB is managed in a FIFO order. The MOB is initially modeled as a single version cache. This restriction is later removed to study the effects of maintaining multiple versions of objects in the server cache. Client cache synchronization and freshness are accomplished by inspecting the CCR at the beginning of each MIBC and by downloading newly created object versions whose PCB values are larger than those of currently cached object versions.

The broadcast program has a flat structure. To account for the high degree of skewness in data access patterns [68] and to exploit the advantages of hybrid data delivery only the latest versions of the most popular 20% of the database objects are broadcast. Note that we assume that clients regularly register at the server to provide their access profiles, so that the server can generate the 144 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

Server Database Parameters Parameter Parameter Value (Sensitivity Range) Database size (DBSize) 10,000 objects Object size (OBSize) 100 bytes Page size (PGSize) 4,096 bytes Server Cache Parameters Server buffer size (SBSize) 20% of DBSize Page buffer memory size 20% of SBSize Object buffer memory size 80% of SBSize Page cache replacement policy LRU Object cache replacement policy (MOB) FIFO Maximum number of versions maintained for each object in the MOB 1 (1-5) version(s) Server Disk Parameters Fixed disk setup costs 5,000 instr Rotational speed 10,000 RPM Media transfer rate 40 Mbps Average seek time (read operation) 4.5 ms Average rotational latency 3.0 ms Variable network costs 7 instr/byte Page fetch time 7.6 ms Disk array size 4 Client/Server CPU Parameters Client CPU speed 100 MIPS Server CPU speed 1,200 MIPS Client/Server page/object cache lookup costs 300 instr Client/Server page/object read costs 5,000 instr Register/Unregister a page/object copy 300 instr Register an object in prohibition list 300 instr Prohibition list lookup costs 300 instr Inter-transaction think time 50,000 instr Intra-transaction think time 5,000 instr

Table 5.1: Summary of the system parameter settings – I (Cache replacement and prefetching policies exper- iments). 5.4. Performance Evaluation 145 clients’ global access pattern. Every MBC is subdivided into 5 MIBC each consisting of a data segment (containing 10 data pages), a (1,m) index [77], and a CCR.

Our modeled network infrastructure consists of three communication paths: (a) a uni- directional broadband broadcast channel, (b) shared uplink channels from the client to the server, and (c) shared downlink channels from the server to the client. The network parameters of those communication paths are modeled after a real system such as Hughes Network System’s Di- recPC1 [70]. We set the default broadcast bandwidth to 12 Mbps and the point-to-point bandwidth to 400 Kbps downstream and to 19.2 Kbps upstream. The point-to-point network is modeled as a shared FIFO queue and each point-to-point channel is dedicated to 5 mobile clients. Charged network costs consist of CPU costs for message processing at the client and server, queuing delay, and transfer time. Processor costs include a fixed and a variable cost component while the latter depends on the message size. With respect to message latency we experimented with a fixed RTT end-to-end latency of 300 ms. To exclude problems arising when clients operate in disconnected mode (e.g. cache invalidation/synchronization), we assume that clients are always tuned to the broadcast stream and do not suffer from intermittent connectivity. Tables 5.1 and 5.2 summarize the system parameters used in the study.

5.4.2 Workload Model

To produce data contention in our simulator, we periodically modify a subset of the data objects maintained at the server by a workload generator that simulates the effects of read-write transactions being executed at the server. In our system configuration, 20 objects are modified by four fixed- size read-write transactions during the period of a MIBC. Objects read and written by read-write transactions are modeled using a Zipf distribution [168] with parameter θ = 0.80. The ratio of the number of write operations versus the number of read operations is fixed at 0.25, i.e., only every fifth operation issued by the server results in an object modification. Read-only transactions are modeled as a sequence of 10 to 50 read operations and they are serialized with the other transactions by the

MVCC-SFBS scheme (see Chapter 4). The access probabilities of client read operations follow a

1DirecPC is now being called DIRECWAY [71]. 146 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

Client Cache Parameters Parameter Parameter Value (Sensitivity Range) Client cache size (CCSize) 2% (1 - 5%) of DBSize Client object cache size 80% of CCSize Client page cache size 20% of CCSize Page cache replacement policy LRU Object cache replacement policy MICP-L Retained information period 1,000 (200 - 2,000) references Aging factor α 0.7 Replacement policy control parameter λ 0.01 PCB calculation frequency 5 times per MIBC Broadcast Program Parameters Number of broadcast disks 1 Number of objects disseminated per MBC 20% of DBSize Number of index segments per MBC 5 Number of CCRs per MBC 5 Bucket size 4,096 bytes Bucket header size 96 bytes Index header size 96 bytes Index record size 12 bytes OID size 8 bytes Network Parameters Broadcast bandwidth 12 Mbps Downlink bandwidth 400 Kbps Uplink bandwidth 19.2 Kbps Fixed network costs 6,000 instr Variable network costs 7 instr/byte Propagation and queuing delay 300 ms Number of point-to-point uplink/downlink channels 2

Table 5.2: Summary of the system parameter settings – II (Cache replacement and prefetching policies experiments). 5.4. Performance Evaluation 147

Workload Parameters Parameter Parameter Value (Sensitivity Range) Read-write transactions size (number of operations) 25 Read-only transactions size (number of operations) 25 (10 - 50) Server data update pattern (Zipf distribution with θ) 0.80 Client data access pattern (Zipf distribution with θ) [0.80, 0.95] Number of updates per MBC 1.0% of DBSize Number of concurrent read-only transactions per client 1 Uplink usage threshold 100% Abort variance 100%

Table 5.3: Summary of the workload parameter settings (Cache replacement and prefetching policies exper- iments).

Zipf distribution with parameter θ = 0.80 and θ = 0.95. While the θ = 0.95 setting is intended to stress the system by directing about 90% of all object accesses to 10% of the database, the θ = 0.80 setting models a more realistic medium-contention workload (about 75/25). To account for the impact on shared resources (point-to-point communication network, server CPU, and magnetic disks) when clients send fetch requests to the server, we model our hybrid data delivery network in a multi-user environment that services 10 mobile clients. As previously noted, clients do not run more than one read-only transaction at a time and they are only allowed to request object versions from the server if they cannot be retrieved from the air-cache. The latter client behavior is enforced by setting the uplink usage threshold [6] to 100% which indicates that independent of the current position of a required object version in the broadcast cycle, it may not be requested from the server.

To control the data access behavior of read-only transactions that were aborted, we use an abort variance of 100% which means that the restarted transaction reads from a different set of objects.

Table 5.3 summarizes the workload parameters of our simulation study.

5.4.3 Other Replacement Policies Studied

In order to show MICP-L’s performance superiority, we need to compare it with state-of-the-art on- line cache replacement and prefetching policies. We experimented with LRFU since it is known to be the best performing page-based cache replacement policy [98, 99]. However, since LRFU does not use any form of prefetching, comparing MICP-L to LRFU directly would be unfair. There- fore, we decided to incorporate prefetching into the LRFU and denote the resulting algorithm as 148 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

LRFU-P. In order to treat LRFU-P as fair as possible with respect to data prefetching, we adopt the prefetching heuristic from MICP-L. That is, we select all newly created object versions along with the versions of those objects that have been referenced within the last 1,000 data accesses as prefetch candidates. Out of those candidates, LRFU-P prefetches those versions whose CRF values are larger than the smallest CRF value of all cached object versions. The rest of the algorithm works as described in [98,99].

In addition to comparing MICP-L to LRFU-P, we experimented with the W2R algorithm [83].

We selected the W2R scheme for comparison mainly because it is an integrated caching and prefetching algorithm similar to MICP and performance results have shown [83] that W2R outper- forms caching and/or prefetching policies such as LRU, 2Q [85], and LRU-OBL [144]. However, since W2R was designed for conventional page-based database systems, it has to be adapted to the characteristics of a mobile broadcast-based data delivery environment in order to be competitive with MICP. Our goal was to re-design W2R in such a way that its original design targets and struc- ture are still maintained. In the following we refer to the amended version of W2R by W2R-B, where B stands for broadcast-based. Like W2R, W2R-B partitions the client cache into two seg- ments, called the Weighing Room and the Waiting Room. While the Weighing Room is managed as an LRU queue, the Waiting Room is managed as a FIFO queue. In contrast to W2R, W2R-B admits both referenced and prefetched object versions into the Weighing Room. However, W2R-B grants admission to the Weighing Room only to newly created object versions, i.e., those listed in CCR and whose underlying objects have been referenced within the last 1,000 data accesses. The other modified objects contained in CCR and all the prefetch candidates from the broadcast channel are kept in the Waiting Room. As before, an object version xi becomes a prefetch candidate if some version of object x has been recently referenced. With regard to the segment sizes, we experimen- tally found out that the following settings work well for W2R-B: the Weighing Room should be

80% of the total available memory size and the remaining 20% should be dedicated to the Waiting

Room.

Last but not least, we experimented with an offline cache replacement algorithm, called P [4], to present the theoretical bounds on the performance of MICP-L. We have chosen P as an offline policy 5.4. Performance Evaluation 149 due to its straightforward implementation as P determines its replacement victims by selecting the object with the lowest access probability. Since the client access pattern follows a Zipf distribution, the access probability of each object is known at any point in time. Note that in Zipf distribution,

1 the probability of accessing the i-th most popular object is pi = N , where N is the number of iθ ∑ 1 j=1 jθ objects in the database, and θ is the skewness parameter [31]. Like LRFU, P is a “pure” caching algorithm. Therefore, we had to extend P by incorporating prefetching. To ensure that clients cache all useful versions of objects with the highest likelihood of access, we added an aggressive prefetching strategy to P and called the new algorithm P-P. P-P’s relatively simple prefetching strategy is as follows: a newly created or disseminated object version xi is prefetched from the broadcast channel, if xi’s access probability is higher than the lowest probability of all cached object versions. It should be intuitively clear, that such a policy is suboptimal since it neither considers the update and caching behavior of the server nor the serial nature of the broadcast channel.

5.4.4 Basic Experimental Results

In the following subsection we compare the performance of MICP-L to that of the online and of-

fline cache replacement and prefetching policies introduced above under the baseline setting of our simulator. We later vary those parameters in order to observe the changes in the relative perfor- mance differences between the policies under different system settings and workload conditions.

We point out that all subsequently presented performance results lie within a 90% confidence level with a relative error of ±5%. We now study the impact of the read-only transaction size on the performance metrics when MICP-L and other policies are used for client cache management. Fig- ures 5.3(a) and 5.3(c) show experimental throughput results of increasing the read-only transaction size from 10 to 50 read operations. A superlinear decrease in the throughput rate is observed when transaction length is increased. More importantly, and as shown in Figures 5.3(b) and 5.3(d), the performance penalty of using LRFU-P or W2R-B in comparison to MICP-L as cache replacement policy is, on average, 19% and 80%, respectively. Further, but not shown in the graphs for reasons of visual clarity, the degradation of the throughput performance caused by computing PCB values periodically (i.e., 5 times per MIBC), rather than every time when replacement victims are se- 150 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions lected, is insignificant since MICP outperforms MICP-L by only 3% on average. When comparing the relative performance differences between MICP-L and the other online policies under the 0.80 workload to those under the 0.95 workload, it is interesting to note that the performance advantage of MICP-L declines when the client access pattern becomes less skewed. The reason is related to the degradation in the client cache effectiveness experienced when client accesses are more uniform in nature and due to a weakening in the predictability of the future reference patterns by inspecting the past reference string. In this situation the impact of the client caching policy on the overall system performance is smaller, and, therefore, the throughput gap between the investigated online policies narrows.

When considering the client cache hit rate, defined as the percentage of the object version requests served by the client cache, we also notice MICP-L’s superiority compared to the other two investigated online policies (see Figures 5.4(a) and 5.4(b)). On average, MICP-L’s cache hit rate is 6% and 94% higher than that of LRFU-P and W2R-B, respectively. At the first glance, the relatively large performance gap between MICP-L and LRFU-P might be surprising since both policies select replacement victims (at least partially) based on the objects’ CRF values. Thus, one would expect cache hit rates of both policies to be fairly close to each other. But since MICP-L tries to minimize broadcast retrieval latencies by replacing popular object versions that soon re- appear on the broadcast channel with other less popular versions, which, if not cached, incur high re-acquisition costs when requested, MICP-L’s hit rate is expected to be slightly lower than that of LRFU-P. However, as both MICP-L and LRFU-P incorporate prefetching, the performance gain from pre-loading objects into the cache is higher for MICP-L since it keeps more object versions in the client cache that are of potential utility for the active read-only transaction, while LRFU-P, on the contrary, maintains more up-to-date versions potentially useful for future transactions.

5.4.5 Additional Experiments

This subsection discusses the results of some other experiments conducted to determine how MICP-

L and its counterparts perform under varying number of retained reference information and number of versions maintained for each object by the server. As before, we report the results for both the 5.4. Performance Evaluation 151

6

P-P (0.80) ¡

MICP-L (0.80) LRFU−P (0.80) ¢¡¢ ¡ 2 ¢¡¢ 5 LRFU-P (0.80) W R−B (0.80)

2 ©¡©

W R-B (0.80) 100

§

¨

©¡©

§

¨

©¡©

§

4 ¨

©¡©

§

¨ ©¡©

80

§

¨

©¡©

¥

¦

§

¨

©¡©

¥

¦

§

3 ¨ ©¡©

¥

¦

§

¨ ©¡©

60 ¥

¦

§

¨

©¡©

¥

¦

§

¨

©¡©

¥

¦

§

¨

©¡©

¥

2 ¦

§

¨ ©¡©

40

¥

£¡£

¦

¤

§ ¨

Throughput / Second

©¡©

¥

£¡£

¦

¤

§

¨

©¡©

¥

£¡£

¦

¤

§

¨

©¡©

¥

£¡£ ¦

1 ¤

§ ¨

20 ©¡©

¥

£¡£

¦ ¤

Relative Performance Penality

§ ¨

compared to MICP−L (Percent)

©¡©

¥

£¡£

¦

¤

§

¨

©¡©

¥

£¡£

¦

¤

§ 0 0 ¨ 10 15 20 25 30 35 40 45 50 10 20 30 40 50 Transaction Length Transaction Lenght (a) Throughput per second under the 0.80 (b) Performance penalty under the 0.80 work- workload load

8

P-P (0.95) ¡

MICP-L (0.95) LRFU−P (0.95) ¢¡¢ ¡ 7 ¢¡¢

LRFU-P (0.95) W 2 R−B (0.95) §

2 ¨

©¡©

§

W R-B (0.95) 100 ¨ ©¡©

6

§

¨

©¡©

¥

¦

§

¨

©¡©

¥

¦

§

¨

©¡©

¥ ¦

5

80 §

¨

©¡©

¥

¦

§

¨

©¡©

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

4

§

¨

©¡©

£¡£

¤

¥

¦

60

§

¨

©¡©

£¡£

¤

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

3

§

¨

©¡©

£¡£

¤

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

40

§

¨

©¡©

£¡£

¤

¥

¦

Throughput / Second

§

2 ¨

©¡©

£¡£

¤

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

20 § 1 ¨

Relative Performance Penality

©¡©

£¡£

¤

¥

¦

compared to MICP−L (Percent)

§

¨

©¡©

£¡£

¤

¥

¦

§

¨

©¡©

£¡£

¤

¥

¦

0 0 10 15 20 25 30 35 40 45 50 10 20 30 40 50 Transaction Length Transaction Lenght (c) Throughput per second under the 0.95 (d) Performance penalty under the 0.95 work- workload load

Figure 5.3: Absolute and relative throughput performance of MICP-L compared to P-P, LRFU-P, and W2R- B under various read-only transaction sizes. Note that the relative performance “penalty” of P-P compared to MICP-L is not specified here, since P-P outperforms MICP-L. 152 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions

0.35 0.55

0.5 0.3 0.45

0.25 0.4 P-P (0.80) P-P (0.95) MICP-L (0.80) MICP-L (0.95) 0.35 0.2 LRFU-P (0.80) LRFU-P (0.95) 2 W R-B (0.80) W2R-B (0.95) Cache Hit Rate Cache Hit Rate 0.3 0.15 0.25

0.1 0.2 10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50 Transaction Length Transaction Length (a) Cache hit rate under the 0.80 workload (b) Cache hit rate under the 0.95 workload

Figure 5.4: Client cache hit rate of MICP-L compared to P-P, LRFU-P, and W2R-B under various read-only transaction sizes.

0.80 and the 0.95 workload.

5.4.5.1 Effects of the Version Storage Policy of the Server on the Performance of MICP-L

To study the effect of keeping multiple versions per modified object at the server, we experimented with varying the version storage strategy of the MOB. As noted before, in order to save installation reads the server maintains modified objects in the MOB. In the baseline setting of the simulator, the MOB was organized as a mono-version object cache, i.e., only the most up-to-date versions of recently modified objects are maintained. We now remove that restriction by allowing the server to maintain up to 10 versions of each object. However, as the MOB is organized as a FIFO queue and limited to 20% of the database size, such a high number of versions will only be maintained for a small portion of the frequently updated database objects. As intuitively expected, the system performance increases with growing number of non-current object versions maintained at the server.

However, it is interesting to note, that the gain in throughput performance levels off when the server maintains more than four non-current versions of a recently modified object. Beyond this point, no significant performance improvement can be achieved by further increasing the version retain boundary. As shown in Figure 5.5, the performance gap of MICP-L relative to LRFU-P and W2R-B narrows when the maximum number of versions maintained for each object increases. For example, 5.4. Performance Evaluation 153 for the 0.95 workload, the throughput performance degrades between MICP-L and LRFU-P from

22% to only 3% as the maximum number of versions maintained by the server increases. The reason is that under a multi-version storage strategy potentially more non-current object version requests from long-running read-only transactions can be satisfied by the server and, thus, fewer read-only transactions have to be aborted. Further, it is worth noticing, that the LRFU-P performs slightly better than the MICP-L if the following two conditions are satisfied: (a) the server does not start overwriting obsolete object versions until at least two versions of each particular object are stored in the MOB. (b) The client access pattern is not very skewed in nature (80/ > 20). The reason why

LRFU-P outperforms MICP-L under such a system setting is the inaccuracy of the PCB values on which MICP-L bases its caching and prefetching decisions.

1.6 3

1.4 2.5 1.2 2 1 P-P (0.80) P-P (0.95) 0.8 MICP-L (0.80) 1.5 MICP-L (0.95) LRFU-P (0.80) LRFU-P (0.95) 2 0.6 W R-B (0.80) W2R-B (0.95) 1 Throughput / Second Throughput / Second 0.4 0.5 0.2

0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Max. Number of Versions in the MOB Max. Number of Versions in the MOB (a) Client data access pattern (θ = 0.80) (b) Client data access pattern (θ = 0.95)

Figure 5.5: Performance of MICP-L compared to its competitors with varying number of non-current ver- sions maintained by the server.

5.4.5.2 Effects of the History Size on the Performance of MICP-L

MICP-L requires historical information on the past reference behavior of the client in order to make precise predictions about its future data accesses. In order to collect this information, objects’ reference history needs to be maintained in the client memory even after their eviction from the cache. Since keeping superfluous history information in form of CRF values wastes scarce memory resources, we wanted to determine a rule of thumb for estimating the amount of reference informa- 154 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions tion clients need to retain in order to achieve good throughput performance. To this end, we use the history size/cache size ratio (HCR) defined as

N HCR = ob j (5.7) Smem

where Nob j denotes the total number of objects for which the client retains historical reference information and Smem represents the client cache size available for storing object versions. As shown in Figure 5.6, we measured MICP-L’s performance for various HCR and client cache size combinations. The results show, that MICP-L reaches its performance peak if clients maintain reference information of all those recently accessed objects that would fit into the client cache if it were about 5 times larger than its actual size. Beyond that point, i.e., when HCR is larger than 5, MICP-L’s throughput slightly degrades. The reason for this degradation is related to an increase in the number of prefetches caused by MICP-L’s prefetching heuristic that allows clients to download useful object versions into their local caches if their corresponding object has been referenced within the retained information period. As a result of those additional prefetches, object versions useful for the active read-only transaction may be replaced by up-to-date object versions which are of potential use for future transactions. This slightly hurts the cache hit rate, and hence the throughput performance.

3 Cache Size 1000 (0.95) Cache Size 1000 (0.80) 5 Cache Size 500 (0.80) Cache Size 500 (0.95) 2.5 Cache Size 300 (0.80) Cache Size 300 (0.95) Cache Size 200 (0.80) Cache Size 200 (0.95) 4 Cache Size 100 (0.80) Cache Size 100 (0.95) 2

3 1.5

2 1 Throughput / Second Throughput / Second

0.5 1

0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 HCR HCR (a) Client data access pattern (θ = 0.80) (b) Client data access pattern (θ = 0.95)

Figure 5.6: Performance of MICP-L under various cache sizes when HCR is varied. 5.5. Conclusion 155

5.5 Conclusion

We have presented the design and implementation of a new integrated cache replacement and prefetching algorithm called MICP. MICP has been evolved to efficiently support the data require- ments of read-only transactions in mobile hybrid data delivery environments. In contrast to many other cache replacement and prefetching policies, MICP not only relies on future reference proba- bilities when selecting replacement victims and for prefetching data objects, but additionally uses information about the content and the structure of the broadcast channel, the data update proba- bility, and the server storage policy. MICP combines those statistical factors into a single metric, called PCB, in order to provide a common basis for decision making and in order to achieve the goal of maximizing the transaction throughput of the system. Further, in order to reduce the num- ber of transaction aborts caused by the eviction of useful, but obsolete, object versions from the server, MICP divides the client cache into two variable-size cache partitions and maintains non- re-cacheable object versions in a dedicated part of the cache, called NON-REC. We evaluated the performance of MICP experimentally using simulation configurations and workloads observed in a real system and compared it with the performance of other state-of-the-art online and offline cache replacement and prefetching algorithms. The obtained results show that the implementable approximation of MICP, termed MICP-L, improves the throughput rate, on average, by 19% when compared to LRFU-P, which is the second best performing online algorithm after MICP-L. Further, our experiments revealed that the performance degradation of MICP-L relative to MICP is a modest

3%. 156 Chapter 5. Client Caching and Prefetching Strategies to Accelerate Read-only Transactions “Theory provides the maps that turn an

uncoordinated set of experiments or com-

puter simulations into a cumulative ex-

ploration.”

– David Goldberg

Chapter 6

Processing Read-Write Transactions Efficiently and

Correctly

6.1 Introduction

As current technology trends indicate, two issues pose a great challenge on mobile database man- agement in the future, namely limited battery power and restricted wireless bandwidth capacities.

To contribute to a solution of the problem, we take a software approach and propose a suite of five new MVCC protocols designed to provide good performance along with strong semantic guarantees to read-write transactions despite the existence of the aforementioned environmental constraints and limits in mobile and wireless technology. The family of MVCC algorithms that we present provides, on the one hand, a range of useful data consistency and currency guarantees to mobile ap- plications that are assumed to access and update data shared among multiple users and, on the other hand, enables application programmers to trade off data currency for performance. The MVCC-* suite, where the asterisk (*) is a placeholder for the names of the various protocols, takes account of the environmental constraints of mobile computing as its underlying protocols were built based on

157 158 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly the following design objectives: (a) Minimizing the amount of wasted work caused by continuing to process transactions, despite the fact that they are doomed to abort, and (b) maximizing the de- gree of transaction concurrency in the system. Clearly, both factors contribute to the general goal of maximizing overall system performance and thus help to save scarce network bandwidth and battery power.

In contrast to previous approaches which try to support read-only transactions efficiently and correctly [101, 123, 124, 138, 139], our protocols tackle the challenge of providing serializability guarantees to read-write transactions by delivering high QoS levels to mobile users by means of minimizing transaction response times and reducing the number of false aborts. To provide support for general-purpose database applications where the inherent semantics have not yet been analyzed or is simply unavailable, we build our MVCC schemes by just considering the manner in which data objects are accessed/modified by transactions, i.e., we chose the read/write model for CC purposes.

Another argument in favor of the read-write model is its simplicity and compatibility with the semantics-based CC model. Since all higher-order transactional operations eventually boil down to simple read and write operations, our adopted computational model and its underlying schemes can be utilized as a profound building block for evolving semantics-based CC schemes. As those approaches have the potential to improve transactional performance even further by exploiting the semantic knowledge of the applications [11,25,141], objects [95,116,157], database structure, etc., we consider them a means to complement our protocols rather than an incompatible alternative.

6.1.1 Motivation

The reason for proposing yet another bunch of CC protocols are the following two observa- tions: (a) Currently available mobile CC protocols are either designed for read-only transactions only [101, 123, 124, 138, 139] or those that efficiently support read-write transactions do not pro- vide serializability guarantees [121]. (b) Those protocols that stick to the serializability criterion either use a mono-version scheduler for CC purposes [102] or do not fully exploit the potential of multi-versioning by keeping only a very limited number of object versions in the system [110].

Furthermore, none of the CC protocols that enforce serializability correctness to read-write trans- 6.1. Introduction 159 actions explicitly specifies the degree of data currency it provides. Note that there exist a number of conventional MVCC protocols and isolation levels, respectively, that incorporate precise data currency guarantees in their specifications, such as Forward Consistent View [9], Snapshot Isola- tion [23], and Oracle’s Read Consistency [117]. None of them, however, ensures serializability to read-write transactions.

Our protocols eliminate that shortage by providing well-defined data currency guarantees and by exploiting the scheduling opportunities offered by a multi-version database system with no a- priori version restriction while not suffering from its high space overheads. Low storage costs are achieved by continuously providing clients with the latest CC information on, among other things, the most recent updates to the common database hosted by the broadcast server through the broadcast channel and by deploying a judicious garbage collector at the clients, eagerly evicting useless object versions as soon as identified by the local transaction manager.

6.1.2 Contribution and Outline

In this chapter, we present a suite of protocols designed for managing read-write transactions in hybrid data delivery environments that differ in terms of performance, degree of data currency, and space and time complexity. In particular, this chapter makes the following contributions: (a) We present five new MVCC protocols that all provide serializability guarantees to read-write transac- tions, prove their correctness, and show their performance results and compare them to the Snapshot

Isolation protocol. (b) We outline the possibilities of extending the protocols in order to reduce the number of false conflicts that occur due to information grouping and in order to avoid transaction restarts by applying conflict resolution techniques. Additionally, we quantify their effects on the overall system performance through simulation studies. (c) We explain why MICP-L as introduced in the previous chapter should be utilized as client cache replacement and prefetching policy inde- pendent of whether read-only or read-write transactions are processed at the clients. This claim is backed up by performance results demonstrating that MICP-L outperforms LRFU-P which is an integrated caching and prefetching variant of the LRFU policy [98, 99] known as the best online cache replacement policy proposed so far. 160 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

The rest of this chapter is organized as follows: In Section 6.2, the system architecture, design assumptions, and notational framework are presented. The algorithms underlying the MVCC-* schemes are incrementally introduced in Section 6.3. We start by motivating for the respective pro- tocol and proceed by presenting their basic algorithm and then gradually refine and extend it in order to obtain a sophisticated and well-performing solution. Performance-related issues are elaborated in

Section 6.4 and Section 6.5 presents performance results of our schemes and relates them to other approaches. Additionally, we show the performance potential of MICP-L when used to quickly satisfy the data requests of mobile applications that perform client-side transaction processing and updates on shared data objects.

6.2 System Design and Assumptions

In this section, we briefly present the system model underlying the discussion that follows, we give some basic definitions, identify our major assumptions and state the notational framework of this chapter. For reasons of efficiency and conformity to our previously presented research studies, we evolve our the MVCC-* suite, again in the context of a hybrid data delivery environment. Thus, the reader familiar with the basic concepts of hybrid data delivery and our assumptions on the content and organization of the broadcast program may omit the following subsection, if wished, or treat it as an opportunity to refresh its memory.

6.2.1 Data Delivery Model

Although the basic ideas intrinsic to the discussed protocols are not restricted to mobile computing, the envisaged area of their use is a hybrid data delivery environment. Remember that we define a hybrid data delivery network as a communications infrastructure that, on the one hand, allows resource-poor clients to establish point-to-point connections with a resource-rich broadcast server, and, on the other hand, allows the server to broadcast information to all clients tuned into the broadcast channel. The reason for choosing a hybrid network as building block for our MVCC-* suite is fourfold: (a) Combining the pull/unicasting and push/broadcasting modes into a hybrid one 6.2. System Design and Assumptions 161 helps overcome the limitations of each individual method, resulting in a performance improvement of either basic communication technique. (b) Because hybrid communication networks exist in both wired and wireless infrastructures, the proposed protocols are applicable for wireless and stationary networks. (c) The prevalent bandwidth asymmetry often found in cellular or satellite networks ideally fits the bandwidth requirements demanded by the CC and cache coherency protocols suitable for mobile environments. Typically, database servers deployed in such environments are stateless in nature, i.e., they do not keep any state information about their serviced clients due to scalability problems. As a consequence, the server has neither information about the state of the transactions processed by clients nor about the contents of their caches. Therefore, the server is forced to transfer either instantaneously or periodically newly accrued concurrency control information along with a copy of each newly created object version to the client population. Obviously, those tasks are most efficiently carried out via the broadcast channel of a hybrid network, whereas data requests for unpopular data objects and transaction commit messages should be handled by means of point-to- point communication. (d) Last, but not least, as previously presented research work (see Chapters 4 and 5) has been conducted within the context of a hybrid network, it is therefore highly desirable to use the same network model in this work for comparison purposes as well.

As partially indicated above, the role of the broadcast medium in a transaction-oriented client- server environment is twofold: (a) to efficiently transmit popular data objects to the client com- munity, and (b) to provide clients or, more precisely, their transaction and cache managers with concurrency and cache coherency information. The data chosen for data dissemination is summa- rized within a broadcast program which has the following logical structure: (a) one or multiple index segments to make the data self-descriptive, (b) one or multiple data segments that contain popular database objects, and (c) one or multiple CCR segments which include information de- sirable for transaction validation and cache synchronization. We assume a simplified broadcast program structured into a number of logical units called MIBC. Each MIBC commences with a

CCR, followed by a (1,m) index [78] which indicates the position of each disseminated object in the broadcast program, and concludes with a sequence of popular data objects belonging to some particular data segment. Since broadcast programs are typically large in size and clients should reg- 162 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly ularly be provided with CC and cache coherency information, we assume that only a subset of data objects scheduled for broadcasting is disseminated within each MIBC. Hence, an MBC consists of a number of MIBCs and is completed after each data object being scheduled for broadcasting has been disseminated once. Figure 6.1 once more depicts the structure of the presumed broadcast program for reasons of reader convenience.

Major Broadcast Cycle Minor BC1 Minor BC2 Minor BC3 Minor BC4 1 1 2 2 3 3 4 4

Data Segment1 Data Segment2 Data Segment1 Data Segment4 CCR CCR CCR CCR Index Index Index Index Segment Segment Segment Segment Segment Segment Segment Segment

CCR ID TID ...... FirstCCRListEntry ReadSet ... WriteSet ... ST ... FirstWriteSetValueListEntry ... OID NextListEntry ... ObjectValue NextListEntry ...

Figure 6.1: Structure of the broadcast program.

As the diagram illustrates, an entry in the CCR segment contains the following CC related information of each recently, i.e., in the previous MIBC, committed read-write transaction Ti: (a) TID, (b) ReadSet, (c) WriteSet, (d) WriteSetValues, (e) ST where TID denotes the globally unique transaction identifier of Ti, ReadSet and WriteSet represents the identifiers of the objects observed and written, respectively, by Ti, WriteSetValues denotes a list of pairs that maps to each newly created object version xi by a read-write transaction Ti its associated value v, and ST denotes a set of transaction identifiers that are serialized after Ti. Note that we do not associate commit timestamps to read-write transactions contained in a CCR since transactions committed during the same MIBC all carry the same commit timestamp being equal to the number of the MIBC when they committed.

This simplification helps us to easy the presentation of our protocols. Further note that CCRs are not primarily required for protocol correctness, but merely for protocol efficiency since they provide the following benefits to mobile users: (a) CCRs help identify useless object versions, thus ensuring that the effective client cache size is not lowered by unnecessarily maintaining those versions; (b) 6.2. System Design and Assumptions 163

CCRs provide scalability since a major part of transaction processing and validation work can be offloaded from the server to the clients; (c) CCRs reduce wasted work since they provide CC information to clients allowing them to recognize those transactions doomed to abort; (d) CCRs keep client caches up-to-date, eliminating the risk of aborts due to stale object observations.

6.2.2 Database and Transaction Model

In our system model the database D = {x1,x2,...,xi} consists of a set of uniquely identified objects, where i denotes the number of data objects in the database and each object has one or more versions being totally ordered w.r.t. one another (with the ordering relation ) according to the commit times of the transactions that created the versions. The state of the database is modified by read- write transactions initiated by mobile applications. A transaction is a partial order of operations on objects of the database (see Definition 9 for its formal definition). To keep our notations consistent with those used in previous chapters, a read operation of a transaction Ti on object x is denoted by ri[xk], where the subscript k is the identifier of the transaction that installed the version xk, i.e., object versions are denoted by the index of the transaction that created the respective version. Write operations of a transactions Ti are denoted by wi[xi], and bi, ai, and ci specify the begin, abort, and commit operation of transaction Ti, respectively. The read set of transaction Ti is the set of data objects that Ti reads and is denoted by ReadSet(Ti). Likewise, the write set of transaction Ti is the set of data objects that Ti writes and is denoted by WriteSet(Ti). To argue about the timing relationships among transactions, the transaction scheduler records of any read-write transaction Ti its start timestamp, denoted by STS(Ti), and its commit timestamp, denoted by CTS(Ti). We assume that read-write transactions do not issue blind writes, i.e., if a transaction writes some object version, it is assumed to have read some previous version of the object before. We make the latter assumption for two reasons: (a) Blind writes are infrequent in applications and (b) accommodating them in our model would complicate the algorithms and correctness proofs of our protocols significantly. We also assume that the same data object is not modified more than once during a transaction’s lifetime and any mobile client runs at most one read-write transaction at a time. 164 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Central to any CC scheme is the notion of conflicts. Since we have opted for the read-write model to perform concurrency control, three kinds of direct conflicts between any pair of trans- actions may occur, namely write-read, write-write, and read-write dependencies. The definitions for those conflicts which are essential for the descriptions and proofs of our protocols have already been given in Section 4.2 of this thesis and are once more summarized in Table 6.1.

Conflict / Dependency Conflict Definition wr direct write-read dependency A transaction Tj δ -depends on a transaction Ti, if Ti wr installs some object version x and T reads the created (denoted δ or wr) i j version, i.e., wi[xi]

Table 6.1: Definitions of possible conflicts between transactions.

We have selected a conflict detection-based or optimistic approach [96] for concurrency con- trol since it frees clients from performing lock acquisitions and exempts the server from carrying out deadlock detection. Another argument in favor of optimistic concurrency control (OCC) is the fact that during periods of disconnection transaction processing will not be interrupted as it is the case when pessimistic schemes are applied. However, OCC protocols force transactions to be validated or certified at their commit points in order to preserve serializability. Optionally, trans- actions may be pre-validated at earlier stages in order to identify transactions that are destined to abort. Independent of the point in time at which an active read-write transaction Ti is validated, the notion of conflict-based serializability requires that Ti must be certified against the set of already committed transactions Tactive = {T1,T2,...,Tn} that were active during Ti’s execution time, i.e.,

∀Tj ∈ Tactive(Ti) : bi

6.3 A New Suite of MVCC Protocols

In this section, we describe three (new) MVCC protocols, namely MVCC-BOT, MVCC-IBOT, and

MVCC-EOT, followed by optimizations of the first two schemes and their correctness proofs. Pre- sentation of the MVCC-BOT and MVCC-IBOT protocols will start out by outlining a basic, not yet very efficient design, and will then be gradually refined and extended until a sophisticated solution is obtained. We opt for such an approach since it allows us to provide correctness arguments in an incremental and concise manner.

6.3.1 MVCC-BOT Scheme

Performance studies conducted in [124] and by ourselves (see Chapter 4) have shown that MVCC protocols that provide begin-of-transaction (BOT) data currency guarantees to read-only transac- tions such as the multi-version caching method or the MVCC-BS protocol outperform those with

EOT data currency guarantees such as the invalidation-only method [124]. The potential of im- proving the overall system performance by providing BOT data currency to read-write transactions was first recognized by [23] and the property has been incorporated into the Snapshot Isolation (SI) protocol. Contrary to many previously proposed CC protocols, SI is implemented in commercial and non-commercial products such as Oracle 10g (referred to as the Serializable isolation level in

Oracle [82, 117]), SQL Server 2005 [152], and PostgreSQL [57]. SI’s popularity in practice arises from its two fundamental properties: (a) Non-current object reads do not cause transaction restarts, and (b) the validation of an active read-write transaction Ti is restricted to an intersection test of

Ti’s write set with the write sets of Tactive(Ti); as read-write transactions typically perform much 166 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly fewer write than read operations, the probability that Ti can be successfully validated is relatively high. Despite SI’s performance attractiveness, it is not an option for mission-critical applications since it may leave the database in a corrupted state due to its inability to prevent the “Write Skew” phenomenon [23]. A protocol that rectifies SI’s consistency problems without losing all of its ben- efits is called MVCC-BOT. Contrary to SI, MVCC-BOT guarantees serializability to read-write transactions, however, ensuring SI’s BOT data currency. Providing BOT data currency guarantees to mobile applications is attractive in at least two respects: (a) Transactions can be provided with useful data currency guarantees that cannot be ensured by serializability alone. For example, if a meteorologist needs temperatures measured by all weather stations around New York as of the current moment, he/she could run a query requesting a respective snapshot, and the system would return the values valid as of the transaction’s starting point, i.e., the meteorologist would not ob- serve any temperature changes after the transaction initiated its processing, which actually might be undesirable for statistical reasons. (b) Transactions are allowed to observe stale object versions which may result in fewer transaction restarts compared to conventional mono-version protocols

(which oblige transactions always to observe the most up-to-date object version) as the following example shows:

Example 6.

MVH4 = b1 b2 b3 r1[x0] r2[x0] r2[y0] r3[y0] r3[z0] w1[x1] r3[x1] w3[z3] w2[y2] c1 c2 a3

[x0  x1,y0  y2,z0  z3]

MVH5 = b1 b2 b3 r1[x0] r2[x0] r2[y0] r3[y0] r3[z0] w1[x1] r3[x0] w3[z3] w2[y2] c1 c2 c3

[x0  x1,y0  y2,z0  z3]

History MVH4 differs from history MVH5 in that the former was produced by a conventional mono- version scheduler, while the latter was generated by a multi-version scheduler enforcing BOT data currency. In history MVH4 transaction T3 was aborted since otherwise the multi-version serial- ization graph of MVH4 would not have been acyclic anymore (see Figure 6.2(a)). Conversely, in history MVH5 all three transactions T1, T2, and T3 terminate successfully because they can be serialized into the order T3 < T2 < T1 (see Figure 6.2(b)).  6.3. A New Suite of MVCC Protocols 167

T wr,ww T wr,ww wr,ww 0 wr,ww 0 wr wr, T1 wr, T3 T1 ww T3 ww rw rw rw rw T2 T2

(a) MVSG(MVH4) (b) MVSG(MVH5)

Figure 6.2: Multi-version serialization graph of MVH4 and MVH5.

After we have seen that MVCC-BOT incorporates attractive properties that are of much practi- cal use to mobile application and database users, we start deepening into the protocol’s details. To avoid wasted work by processing transactions that are destined to abort, MVCC-BOT validates any read-write transaction Ti not only at its commit point, but also during its execution time, namely whenever it issues a new data operation or its processing client receives a new CCR. To do so,

MVCC-BOT (like the other schemes of the MVCC-* suite) uses the backward validation tech- nique [62]. Since MVCC-BOT ensures BOT data currency guarantees, the validation algorithm is straightforward to implement. To provide serializability, a validating read-write transaction Ti needs to be checked (only) against all those transactions that are in Tactive(Ti), i.e., such transactions that committed during Ti’s execution time. MVCC-BOT successfully validates an active read-write transaction Ti against a read-write transaction Tj ∈ Tactive(Ti), if the following condition is satisfied:

Serializability Condition 1. Ti must not (either directly or indirectly) rw-depend on Tj, i.e., Ti must not overwrite an object version (either directly or indirectly) read by Tj. 

In order to ensure BOT data currency guarantees and to detect violations of the serializability crite- rion instantly, the MVCC-BOT scheduler operates as follows: 168 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

1. Start Rule: As soon as Ti issues its first read operation, the identifier of the current MIBC is stored into STS(Ti).

2. Read Rule: A read operation ri[x] is processed as follows:

(a) A read operation ri[x] is transformed into ri[xk], where xk is the latest committed version of x that was created by a read-write transaction Tk such that CTS(Tk) < STS(Ti).

(b) If there exists a write operation w j[x j] in MVH such that Tj ∈ Tactive(Ti) and Tj rw-depends on Ti, i.e., it has overwritten the object version xk observed by Ti (xk  x j), then the read operation is rejected and Ti is aborted.

(c) To record the fact that Tk precedes Ti in any valid one-copy serializable history, the scheduler inserts Tk into PT(Ti).

3. Write Rule: A write operation wi[x] is processed as follows:

(a) If there exists a read operation r j[xk] in MVH and

i. Tj ∈ ST(Ti) or ii. Tj (either directly or indirectly) wr- or rw-depends on some read-write transaction Tk ∈ ST(Ti),

then wi[x] is rejected and Ti is aborted.

(b) Otherwise, wi[x] is transformed into wi[xi] and executed.

Algorithm 6.1: MVCC-BOT’s scheduling algorithm.

CCRs enable clients (among other things) to pre-validate active read-write transactions without contacting the server. MVCC-BOT uses Algorithm 6.2 to validate an active read-write transaction

Ti and to update its associated data structure (ST(Ti)) as shown below:

1 begin 2 foreach Tj in CCR do 3 if Serializability Condition 1 holds then 4 if Tj rw-depends on Ti then 5 insert Tj along with ST(Tj) into ST(Ti); 6 if ST(Ti) ∩ PT(Ti) 6= 0/ then 7 abort Ti

8 else 9 abort Ti

10 end

Algorithm 6.2: CCR processing and transaction validation under MVCC-BOT. 6.3. A New Suite of MVCC Protocols 169

In case Ti fails the validation check, it will be aborted and subsequently restarted. Otherwise, the execution of Ti proceeds. Once Ti issues a commit primitive, i.e., all of its data operations were successfully executed by the client, it automatically pre-commits without any further pre- validation and a final validation messages, denoted FVM(Ti), is send to the server. A FVM(Ti) is a 5-tuple (ReadSet(Ti),WriteSet(Ti),PT(Ti),ST(Ti),TSC(CCRlast )), where TSC(CCRlast ) denotes the timestamp of the last CCR that has been successfully processed by client C; in case the client was disconnected for some time during Ti’s execution and, therefore, might have missed one or more CCRs, then C sends the timestamp of the largest CCR up to which C had received a complete sequence of CCRs. Note that the client does not need to attach Ti’s start timestamp to FVM(Ti) as such information is only required for the local scheduler to transform object read operations into consistent object version reads.

When the server receives the validation information of a committing transaction Ti it runs Algo- rithm 6.2 for all read-write transactions Tj where CTS(Tj) > TSc(CCRlast ). If Ti’s final validation succeeds, Ti’s updates are applied to the central database hosted by the broadcast server and a re- spective notification is send to the client through the point-to-point channel. Otherwise, Ti aborts and its restart is implicitly initiated through the abort message. To conclude, we show that MVCC-

BOT produces only serializable histories that provide BOT data currency guarantees to read-write transactions.

Theorem 8. Every history generated by MVCC-BOT is serializable and any read-write transaction

Ti sees a database state as it existed at the beginning of the MIBC when Ti started executing.

Proof. The proof consists of two parts. In the first part we show that MVCC-BOT ensures BOT data currency. Thereafter, we proof by contradiction that MVCC-BOT produces only serializable histories. We do so by first considering the special case that MVSG(MVH) contains a cycle involv- ing only two transactions and then turn to the more general case that the cycle is formed by three or even more transactions.

Part A: Let Ti denote a read-write transaction, STS(Ti) the logical time when Ti started its exe- cution, and DSSTS(Ti) the database state as it existed at Ti’s starting point. We will show that all data objects read by Ti belong to DSSTS(Ti) and, therefore, MVCC-BOT ensures BOT data currency guar- 170 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly antees. MVCC-BOT’s read rule (i.e., Point (2a)) enforces that Ti always reads the latest committed objects versions that were created by read-write transactions with commit timestamps < STS(Ti). Those object versions are, according to the MVCC-BOT’s start rule, the most up-to-date ones as of

the beginning of the MIBC when Ti started its execution, i.e., they belong to DSSTS(Ti). Part B: Let MVH denote any serializable multi-version history and MVSG(MVH) its multi- version serialization graph. Because MVH is serializable, its MVSG(MVH) is acyclic. Suppose, by way of contradiction, that MVSG(MVH) contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti and Tjn with n ≥ 0 denote read-write transactions that have been processed under the MVCC-BOT scheme.

(1) Cycle consists of exactly two transactions: To start with, let us assume that the cycle consists

of two transactions only, namely Ti and Tj0 , i.e., n = 0. Then, one of the following dependency

relationships between Ti and Tj0 must hold: (a) Ti and Tj0 wr-depend on each other, (b) Ti and

Tj0 rw-depend on each other, or (c) Ti wr-depends on Tj0 and Tj0 rw-depends on Ti or vice versa.

(a) Suppose that Ti wr-depends on Tj0 and as MVCC-BOT ensures BOT data currency guar-

antees, it follows that the ordering relation c j0

Ti, it also follows that the ordering relation ci

only commit after it has been initiated. Thus, MVSG(MVH) cannot contain the cycle wr wr

hTi δ Tj0 δ Tii when the MVCC-BOT scheme is used.

(b) Suppose that Ti rw-depends on Tj0 , ci precedes c j0 in MVH, and Tj0 does not rw-depend

on Ti by the time the former gets to know about Ti’s commit. Then, it follows that Ti

is an element of ST(Tj0 ) according to Algorithm 6.2. Now suppose at a later stage Tj0

overwrites some object version observed by Ti which implies that Tj0 now rw-depends on Ti. Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)i)) would be violated if this write

operation were not rejected. Suppose otherwise that Tj0 had already rw-depended on Ti by

the time Tj0 was validated against Ti. Then, Serializability Condition 1 of Algorithm 6.2

would be violated if Tj0 were not aborted. Alternatively, suppose that Ti rw-depends on Tj0 6.3. A New Suite of MVCC Protocols 171

and that c j0 precedes ci in MVH. Then, Serializability Condition 1 of Algorithm 6.2 would rw rw

be violated if Ti were not aborted. Hence, the cycle hTi δ Tj0 δ Tii cannot be produced under the MVCC-BOT scheme.

(c) Suppose finally that Ti wr-depends on Tj0 . Then, it follows that c j0 precedes bi. Now,

however, Tj0 may not rw-depend on Ti since otherwise MVCC-BOT’s BOT data currency

property would be violated. Using the same line of reasoning, the dependencies Ti rw-

depends on Tj0 and Tj0 wr-depends on Ti cannot co-exist without that the MVCC-BOT rw wr wr rw

scheme is violated. Thus, the cycles hTi δ Tj0 δ Tii and hTi δ Tj0 δ Tii cannot occur in MVSG(MVH) and, therefore, it may not contain a cycle involving exactly two transactions.

(2) Cycle consists of three or more transactions: In the more complex case, the cycle may involve

three or more read-write transactions. Irrespective of how many transactions form the cycle, it

must have an edge Ti → Tj0 , where Tj0 is a read-write transaction that either wrote the successor

version of an object read by Ti, i.e., Tj0 rw-depends on Ti, or it observed an object version

written by Ti, i.e., Tj0 wr-depends on Ti.

(a) Now let us initially assume that Tj0 wr-depends on Ti which implies that ci

Ti belongs to PT(Tj0 ). Suppose also that Tjn (either directly or indirectly) wr- and/or rw-

depends on Tj0 .

I) If Tjn wr-depends on Tj0 , the ordering relation c j0

ci

plies that the ordering relations b jn

tion 9, a transaction may only commit after it has been initiated and it may com- wr wr wr wr

mit only once during its lifetime. Thus, the cycles hTi δ Tj0 δ ... δ Tjn δ Tii and wr wr wr rw

hTi δ Tj0 δ ... δ Tjn δ Tii cannot be formed in MVSG(MVH).

II) However, the cycle may be produced if Tjn rw-depends on Tj0 which implies that

b j0 precedes c jn and, thus, ci

cannot wr-depend on Tjn and, therefore, may only rw-depend on it. Since the ordering 172 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

relations b jn

? Now suppose that Tj0 commits before Tjn and that Tjn does not rw-depend on

Tj0 by the time the former gets to know about Tj0 ’s commit. Now suppose at

a later point in time Tjn overwrites some object version observed by Tj0 which

implies that Tjn now rw-depends on Tj0 . Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)ii)) would be violated if this write operation were not re-

jected. Suppose otherwise that Tjn had already rw-depended on Tj0 by the

time Tjn was validated against Tj0 . Then, however, Serializability Condition 1

of Algorithm 6.2 would be violated if Tjn were not aborted. Thus, the cycle wr rw rw rw

hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created in MVSG(MVH) when Tj0 commits

before Tjn .

? Suppose, on the contrary, that Tj0 commits after Tjn and that Tjn does not rw-

depend on Tj0 by the time the latter gets informed about Tjn ’s commit. Sup-

pose further that at some later point in time Tj0 observes some object version

whose successor version was installed by Tjn , thus Tjn rw-depends on Tj0 . Then, however, MVCC-BOT’s read rule (i.e., Point (2b)) would be violated if the

read operation were not rejected. Suppose, alternatively, that Tjn had already

rw-depended on Tj0 when Tj0 was validated against Tjn . Then, the condition

ST(Tj0 ) ∩ PT(Tj0 ) 6= 0/ tested at line 6 of Algorithm 6.2 would be violated if Tj0

were not aborted. The reason is as follows: Since Ti commits before Tjn and Ti rw-

depends on Tjn , Ti is an element of ST(Tjn ). Since Tjn rw-depends on Tj0 , Tjn and

Ti are included in ST(Tj0 ) according to Algorithm 6.2. Because Tj0 wr-depends

on Ti, it follows that Ti is a member of PT(Tj0 ) according to MVCC-BOT’s read

rule. Obviously, the condition ST(Tj0 ) ∩ PT(Tj0 ) = 0/ is not satisfied any more

and, therefore, Tj0 would be aborted by MVCC-BOT. Consequently, the cycle wr rw rw rw

hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created when Tj0 commits after Tjn .

(b) What remains to be shown is that the cycle cannot be created when Tj0 rw-depends on Ti. 6.3. A New Suite of MVCC Protocols 173

If such a dependency exists, bi precedes c j0 in MVH.

I) Now suppose that Tjn (either directly or indirectly) wr-depends on Tj0 which im-

plies that c j0 occurs before b jn and as the relation

bi

i) Now suppose that Ti wr-depends on Tjn which implies that c jn occurs before bi.

Since

c jn

cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH).

ii) Therefore, Ti may only rw-depend on Tjn which implies that b jn precedes ci. Since

the ordering relations bi

? Initially suppose that Tjn commits before Ti and that Ti does not rw-depend on Tjn

by the time the former gets to know about Tjn ’s commit. Suppose further that at

some later point in time Ti overwrites some object version observed by Tjn , thus Ti

rw-depends on Tjn . Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)ii)) would be violated if the write operation were not rejected. Suppose, alternatively,

that Ti had already rw-depended on Tjn when Ti was validated against Tjn . In this

case Serializability Condition 1 of Algorithm 6.2 would be violated if Tjn were not rw wr wr rw

aborted. That is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH)

when Tjn commits before Ti.

? Alternatively, suppose that Tjn commits after Ti and that Ti does not rw-depend on

Tjn by the time the latter gets to know about Ti’s commit. Suppose further that at

some later point in time Tjn observes some object version whose successor version

was installed by Ti, thus Ti rw-depends on Tjn . Then, however, MVCC-BOT’s read rule (i.e., Point (2b)) would be violated if the read operation were not rejected. Sup-

pose, alternatively, that Ti had already rw-depended on Tjn when Tjn was validated

against Ti. Given those facts, the condition ST(Tjn ) ∩ PT(Tjn ) = 0/ at line 6 would 174 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

rw wr wr rw

be violated if Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii

cannot occur in MVSG(MVH) even if Tjn commits after Ti.

II) As another alternative to produce a cycle suppose that Tjn (either directly or indirectly)

rw-depends on Tj0 . Then, Ti may wr- and/or rw-depend on Tjn .

i) Suppose Ti wr-depends on Tjn which implies that Tjn committed before Ti’s starting

point and Tjn is an element of PT(Ti). Since the relations bi

hold, Tj0 and Ti are concurrent to each other and, therefore, can commit in arbitrary order.

? Suppose that Ti commits before Tj0 and that Tj0 does not rw-depend on Ti by the

time the former gets to know about Ti’s commit. Additionally, suppose that at some

later point in time Tj0 overwrites some object version observed by Ti, thus Tj0 rw-

depends on Ti. Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)ii)) would be violated if the write operation were not rejected. Suppose, alternatively, that

Tj0 had already rw-depended on Ti when Tj0 was validated against Ti. Then, how-

ever, Serializability Condition 1 of Algorithm 6.2 would be violated if Tj0 were not rw rw rw wr

aborted. That is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH)

if Ti commits before Tj0 .

? Suppose, on the contrary, that Ti commits after Tj0 and that Tj0 does not rw-depend

on Ti by the time the latter gets to know about Tj0 ’s commit. Now suppose that at

some later point in time Ti observes some object version whose successor version

was installed by Tj0 , thus Tj0 rw-depends on Ti. Then, however, MVCC-BOT’s read rule (i.e., Point (2b)) would be violated if the read operation were not rejected.

Suppose, on the contrary, that Tj0 had already rw-depended on Ti when the latter

was validated against Tj0 . This, however, leads to a violation of Algorithm 6.2 since

the intersection of ST(Ti) and PT(Ti) returns a non-empty result set and, therefore, rw rw rw wr

Ti cannot commit. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created in

MVSG(MVH) even if Ti commits after Tj0 .

ii) Suppose as a final means for the cycle to be formed that Ti rw-depends on Tjn 6.3. A New Suite of MVCC Protocols 175

which implies that the ordering relation b jn

relations bi

Tjn are executed concurrently and thus, can commit in arbitrary order. Without loss

of generality, assume that the commit order is ci

rw-depend on Tj0 by the time Tjn gets to know about Tj0 ’s commit. Now suppose that

at some later point in time Tjn overwrites some object version observed by Tj0 , thus Tjn

now rw-depends on Tj0 . Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)ii)) would be violated if the write operation were not rejected. Suppose, alternatively,

that Tjn had already rw-depended on Tj0 when Tjn was validated against Tj0 . Then,

however, Serializability Condition 1 of Algorithm 6.2 would be violated if Tjn were rw rw rw rw

not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH) and it can be concluded that MVCC-BOT produces only serializable histories.



6.3.2 Optimizing the MVCC-BOT Scheme

To keep the algorithm simple and to focus on its basic principles, we left out such issues as opti- mization techniques that aim at extending the transaction manager’s scheduling flexibility, thereby improving the overall system performance since false transaction aborts are avoided. As an rep- resentative example providing a better understanding of the improvement potential, consider the following history produced by a scheduler that ensures BOT data currency guarantees:

Example 7.

MVH6 = bc1 b1 r1[x0] bc2 r1[y0] b2 r2[x0] bc3 r2[z0] bc4 w2[z2] c2 w1[x1] c1 [x0  x1,z0  z2]

Note that compared to previous histories MVH6 has been enriched with logical time information.

Here each occurrence of bcn denotes the beginning of a new MIBC, where n stands for the non- decreasing identifier of the respective cycle. Note further that the scheduling scenario illustrated in MVH6 would not be allowed by MVCC-BOT since Serializability Condition 1 is violated for 176 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly transaction T1. However, as MVSG(MVH6) is acyclic (see Figure 6.3) and both transactions meet the BOT data currency criterion, MVH6 should be accepted by the scheduler. 

T wr,ww wr,ww 0 rw T1 T2

Figure 6.3: Multi-version serialization graph of MVH6.

If one examines MVH6 more closely is soon becomes obvious where MVCC-BOT’s validation inefficiently is related to. When validating a read-write transaction Ti, MVCC-BOT assumes that each transaction Tj ∈ Tactive(Ti) has to be serialized after Ti. However, there may exist situations where Tj can be safely serialized before Ti. Informally speaking, this is the case if a transaction

Tj ∈ Tactive(Ti) has read from the same or even an earlier database snapshot than Ti and has not written any object such that Tj (either directly or indirectly) rw-depends on Ti. In case Tj ∈ Tactive(Ti) has read some object version xk such that CTS(Tk) ≥ STS(Ti), i.e., Tj has observed a latter database snapshot than Ti, then Tj must neither (directly or indirectly) rw-depend on Ti nor (directly or indirectly) wr- or rw-depends on some transaction Tk that itself (directly or indirectly) wr- or rw- depends on Ti (in order to be serialized before Ti).

A protocol that avoids spurious transaction aborts such as illustrated in history MVH6 by using a more sophisticated transaction validation algorithm than MVCC-BOT is called MVCC-BOTO and a description of its peculiarities follows. MVCC-BOTO optimizes the MVCC-BOT scheme by checking for each read-write transaction Tj ∈ Tactive(Ti) whether it can be serialized before or after

Ti or whether Ti needs to be aborted since a serialization conflict has been identified. It does so, by using Serializability Conditions 1, 2, 3, and 4 with the latter three being specified below:

Serializability Condition 2. Tj must not (either directly or indirectly) rw-depend on Ti, i.e., Tj must not overwrite an object version (either directly or indirectly) read by Ti. 

Serializability Condition 3. Tj must not (either directly or indirectly) wr- or rw-depend on some read-write transaction Tk ∈ ST(Ti), i.e., it must not (either directly or indirectly) observe the effects of Tk or overwrite an object version (either directly or indirectly) observed by Tk.  6.3. A New Suite of MVCC Protocols 177

Serializability Condition 4. Ti must not (either directly or indirectly) wr- or rw-depend on some read-write transaction Tk ∈ ST(Tj), i.e., it must not (either directly or indirectly) observe the effects of Tk or overwrite an object version (either directly or indirectly) observed by Tk. 

Classifying a read-write transaction Tj ∈ Tactive(Ti) as belonging into PT(Ti) or ST(Ti) is han- dled by MVCC-BOTO’s CCR processing algorithm as illustrated below:

1 begin 2 foreach Tj in CCR do 3 if Serializability Conditions 2 and 3 hold then 4 insert Tj into PT(Ti) 5 else 6 if Serializability Conditions 1 and 4 are satisfied then 7 insert Tj into ST(Ti) 8 else 9 abort Ti

10 end

Algorithm 6.3: CCR processing and transaction validation under MVCC-BOTO.

In addition to modifying MVCC-BOT’s CCR processing algorithm, its scheduling algorithm must be adapted to handle the fact that assigning validated transactions into PT(Ti) is only prelimi- nary until Ti has executed its last read operation. By the time a read-write transaction Tj ∈ Tactive(Ti) is assigned into PT(Ti), the scheduler’s decision is based on information on Ti’s current read set. As

Ti’s read set may subsequently become larger, previous assignments of validated transactions into

PT(Ti) might not anymore be valid under the changed conditions. Clearly, a read-write transaction

Tj needs to be removed from PT(Ti) whenever Serializability Condition 2 is violated. In that case Tj cannot any more be serialized before Ti, i.e., the only possible serialization order would be Ti < Tj. However, Serializability Condition 4 has to hold then. Determining whether some read-write trans- action Tj ∈ PT(Ti) cannot be serialized before Ti and thus needs to be removed from PT(Ti) is best institutionalized in MVCC-BOTO’s scheduling algorithm which is depicted below: 178 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

1. Start Rule: See Algorithm 6.1.

2. Read Rule: A read operation ri[x] is processed as follows:

(a) A read operation ri[x] is transformed into ri[xk], where xk is the latest committed version of x that was created by a read-write transaction Tk such that CTS(Tk) < STS(Ti).

(b) To record the information that Tk precedes Ti in any serial history, the scheduler inserts Tk into PT(Ti).

(c) Additionally, a transaction Tj ∈ PT(Ti) that (now) rw-depends on Ti together with any transaction Tk ∈ PT(Ti) that (either directly or indirectly) wr- or rw-depends on Tj is moved into ST(Ti):

i. If xk has been overwritten by Tj, i.e., Serializability Condition 2 is violated, and Ti does not wr- or rw-depend on any transaction Tk ∈ ST(Tj), i.e., Serializability Condition 4 holds.

ii. Otherwise, Tj belongs to CT(Ti) and Ti needs to be aborted.

3. Write Rule: See Algorithm 6.1.

Algorithm 6.4: MVCC-BOTO’s scheduling algorithm.

Upon issuing the last read operation by Ti, PT(Ti) won’t change any more in the sense that no read-write transaction Tj has been erroneously assigned into PT(Ti) and, therefore, needs to be removed. Between the time of Ti’s pre-commit at the client and the time of its final validation at the server, additional transactions typically commit and, therefore, need to be validated against Ti.

In order to efficiently determine which of those transactions belong to PT(Ti), ST(Ti), and CT(Ti), the server requires information on the contents of ST(Ti) as of Ti’s pre-commit time. Therefore, and analogous to the MVCC-BOT scheme, MVCC-BOTO should piggyback the identifiers of all transactions recorded in ST(Ti) on Ti’s final validation message. It is important to note that clients provide pre-committed transactions’ ST sets not for protocol correctness purposes, but merely to reduce the CPU overhead incurred by their validation at the server. As before, we conclude this subsection by showing that MVCC-BOTO produces only serializable histories that provide BOT data currency guarantees to read-write transactions.

Theorem 9. MVCC-BOTO produces only correct histories in the sense that they are serializable and any object version read by a committed read-write transaction Ti in MVH has been up-to-date 6.3. A New Suite of MVCC Protocols 179 at the beginning of the MIBC when Ti started its execution.

Proof. In this proof we only need to show that MVCC-BOTO generates serializable histories since both the MVCC-BOT and MVCC-BOTO schemes apply the same version function (read rule) for mapping data requests to specific object versions and as has been shown in Part A of the proof of

Theorem 8, MVCC-BOT ensures BOT data currency guarantees to read-write transactions. Again let MVH denote any serializable multi-version history and let MVSG(MVH) be its corresponding multi-version serialization graph. Suppose, by way of contradiction, that MVSG(MVH) contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti and Tjn with n ≥ 0 are read-write transactions that have been executed according to the MVCC-BOTO scheme.

Without loss of generality, possible cycles can be grouped into two classes: (1) those that in- volve only two transactions, and (2) others that contain three or more transactions. Since cycles consisting of only two transactions may not occur in the MVCC-BOTO scheme for similar or even the same reasons as specified in the second paragraph of Part B of the proof given for Theorem 8, we only need to show that cycles consisting of three or more transactions may not be possible either.

(a) To do so, let us initially assume that (in the cycle) Tj0 wr-depends on Ti which implies that

ci

I) Suppose also that Tjn (either directly or indirectly) wr-depends on Tj0 which implies ac- O cording to MVCC-BOT ’s read rule that c j0

Now suppose that Ti rw- or wr-depends on Tjn which implies that the ordering relations

b jn

after it has been initiated and it may commit only once during its lifetime. Thus, the wr wr wr wr wr wr wr rw

cycles hTi δ Tj0 δ ... δ Tjn δ Tii and hTi δ Tj0 δ ... δ Tjn δ Tii cannot be formed in MVSG(MVH).

II) Consequently, Tjn may only (either directly or indirectly) rw-depend on Tj0 which implies

that b j0 precedes c jn and, thus, the ordering relation ci

cedes c jn in MVH, Ti cannot wr-depend on Tjn and, therefore, may only rw-depend on it.

Since the ordering relations b jn

? Now suppose that Tj0 commits before Tjn and that Tjn does not rw-depend on Tj0 by

the time the former gets to know about Tj0 ’s commit. Now suppose at a later point in

time Tjn overwrites some object version observed by Tj0 which implies that Tjn now O rw-depends on Tj0 . Then, however, MVCC-BOT ’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if this write operation were not rejected. Suppose

otherwise that Tjn had already rw-depended on Tj0 by the time Tjn was validated

against Tj0 . Then, however, Serializability Condition 1 or 3 of Algorithm 6.3 would wr rw rw rw

be violated if Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot

be created in MVSG(MVH) when Tj0 commits before Tjn .

? Suppose, on the contrary, that Tj0 commits after Tjn and that Tjn does not rw-depend

on Tj0 by the time the latter gets informed about Tjn ’s commit. Suppose further that

at some later point in time Tj0 observes some object version whose successor version O was installed by Tjn . Then, however, MVCC-BOT ’s read rule (i.e., Point (2(c)i)) would be violated if the read operation were not rejected. Suppose, on the contrary,

that Tjn had already rw-depended on Tj0 when Tj0 was validated against Tjn . Then,

however, Serializability Condition 2 or 4 of Algorithm 6.3 would be violated if Tj0 wr rw rw rw

were not aborted. Consequently, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be

created when Tj0 commits after Tjn .

(b) What remains to be shown is that the cycle cannot be formed when Tj0 rw-depends on Ti. If Tj0

rw-depends on Ti, it follows that bi precedes c j0 in MVH.

I) Now suppose that Tjn (either directly or indirectly) wr-depends on Tj0 which implies that

c j0 occurs before b jn in MVH, thus the ordering relation bi

i) Now suppose that Ti wr-depends on Tjn which implies that c jn occurs before bi.

Since

c jn

hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH).

ii) Therefore, Ti may only rw-depend on Tjn which implies that the ordering relation

b jn

Tjn are concurrent to each other and, therefore, can commit in arbitrary order.

? Initially suppose that Tjn commits before Ti and that Ti does not rw-depend on Tjn

by the time the former gets to know about Tjn ’s commit. Suppose further that at

some later point in time Ti overwrites some object version observed by Tjn . Then, however, MVCC-BOTO’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be

violated if the write operation were not rejected. Alternatively, suppose that Ti had

already rw-depended on Tjn by the time Ti was validated against Tjn . Given those facts,

Serializability Condition 1 or 3 of Algorithm 6.3 would be violated if Tjn were not rw wr wr rw

aborted. That is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH)

when Tjn commits before Ti.

? Alternatively suppose that Tjn commits after Ti and that Ti does not rw-depend on Tjn

by the time the latter gets to know about Ti’s commit. Suppose further that at some

later point in time Tjn observes some object version whose successor version was O installed by Ti. Then, however, MVCC-BOT ’s read rule (i.e., Point (2(c)i)) would

be violated if the read operation were not rejected. Suppose, on the contrary, that Ti

had already rw-depended on Tjn when Tjn was validated against Ti. Then, however,

Serializability Condition 2 or 4 of Algorithm 6.3 would be violated if Tjn were not rw wr wr rw

aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH)

even if Tjn commits after Ti.

II) As another alternative to form the cycle suppose that Tjn (either directly or indirectly)

rw-depends on Tj0 . Then, Ti may wr- and/or rw-depend on Tjn .

i) Suppose Ti wr-depends on Tjn which implies that c jn precedes bi and that Tjn is an

element of PT(Ti). Because the relations bi

are concurrent to each other and, therefore, can commit in arbitrary order.

? Suppose that Ti commits before Tj0 and that Tj0 does not rw-depend on Ti by the

time the former gets to know about Ti’s commit. Additionally, suppose that at some

later point in time Tj0 overwrites some object version observed by Ti. Then, however, MVCC-BOT’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if

the write operation were not rejected. Suppose, alternatively, that Tj0 had already

rw-depended on Ti when Tj0 was validated against Ti. Then, however, Serializability

Condition 1 or 3 of Algorithm 6.3 would be violated if Tj0 were not aborted. That rw rw rw wr

is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH) if Ti commits

before Tj0 .

? Suppose, on the contrary, that Ti commits after Tj0 and that Tj0 does not rw-depend

on Ti by the time the latter gets to know about Tj0 ’s commit. Now suppose that at

some later point in time Ti observes some object version whose successor version

O was installed by Tj0 . Then, however, MVCC-BOT ’s read rule (i.e., Point (2(c)i)) would be violated if the read operation were not rejected. Suppose, on the contrary,

that Tj0 had already rw-depended on Ti when the latter was validated against Tj0 . This, however, leads to a violation of Serializability Condition 2 or 4 of Algorithm 6.3 if rw rw rw wr

Ti were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created in

MVSG(MVH) even if Ti commits after Tj0 .

ii) Suppose finally that Ti rw-depends on Tjn which implies that the ordering rela-

tion b jn

and b jn

der is ci

know about Tj0 ’s commit. Now suppose that at some later point in time Tjn overwrites O some object version observed by Tj0 . Then, however, MVCC-BOT ’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected.

Suppose, alternatively, that Tjn had already rw-depended on Tj0 when Tjn was validated 6.3. A New Suite of MVCC Protocols 183

against Tj0 . Then, however, Serializability Condition 1 or 3 of Algorithm 6.3 would be rw rw rw rw

violated if Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot oc- cur in MVSG(MVH) and we can conclude that MVCC-BOTO produces only serializable

histories.



6.3.3 MVCC-IBOT Scheme

By enforcing BOT data currency guarantees, read-write transactions may observe numerous out- of-date data objects depending on the update frequency and update pattern of other concurrently running read-write transactions. While reading from a consistent database snapshot as it existed by a transaction’s starting point might be a desirable characteristic for quite a number of applications, it may not be the best choice in terms of overall system performance. The reason is that the perfor- mance of transaction-based database system is heavily impacted by the transaction abort ratio or, in more general terms, the amount of wasted work done by the system. In systems utilizing optimistic

CC schemes, the abort ratio typically correlates with the number of transactions that a validating

O transaction Ti is validated against. In case of the MVCC-BOT and MVCC-BOT schemes, the cardinality of ST(Ti) determines (among other things) the probability of Ti being successfully vali- dated by the scheduler. Due to the strictness of the BOT data currency guarantees enforced by both schemes, all or nearly all transactions that committed between Ti’s starting and commit point are

O recorded in ST(Ti) (some transactions might be assigned to PT(Ti) by the MVCC-BOT scheme). Therefore, the conflict potential in both protocols is not expected to be insignificant.

MVCC-IBOT is a protocol designed to address the performance problem that the MVCC-BOT and MVCC-BOTO schemes are likely to experience due to frequent validation failures. As for the MVCC-BOT and MVCC-BOTO schemes, MVCC-IBOT provides serializability consistency to read-write transactions. However, and with the intention to improve system performance, it slightly changes the way read operations are translated by the scheduler into corresponding object version reads. In contrast to MVCC-BOT and MVCC-BOTO, MVCC-IBOT does not enforce that read-write transactions have to observe a database snapshot as it existed by their starting points. 184 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

MVCC-IBOT implements a more timely approach by demanding that any object read by a read- write transaction Ti should be at least as recent as the one that existed by its starting point, i.e., the scheme ensures in-between-of-transaction (IBOT) data currency guarantees. The intuition behind allowing read-write transactions to read “forward” from a database state later than their starting points is to increase the scheme’s scheduling power which, in turn, may result in more correct histories. In this respect, however, it should be noted that increasing the number of versions a scheduler can choose from when mapping read operations to actual version read steps does not automatically result in performance gains, but on the contrary, if performed injudiciously, it may even cause performance degradations.

MVCC-IBOT incorporates straightforward, but efficient heuristics when translating read oper- ations into object version reads. The basic idea used by MVCC-IBOT is to force each read-write transaction Ti to read “forward” on object versions created after its starting point until the object version read by Ti is overwritten by some read-write transaction Tj ∈ Tactive(Ti). This gives rise to the following definition:

Definition 35 (Transaction Invalidation). We say that an active read-write transaction Ti gets invalidated by a read-write transaction Tj ∈ Tactive(Ti), if Tj installs the successor version of some object version read by Ti and commits. 

When Ti finds out that it has been invalidated during the last MIBC, it stops reading forward and from now on it reads “only” those object versions having the largest timestamp < RFSTS(Ti), where RFSTS (which stands for read forward stop timestamp) denotes the commit timestamp of the invalidating transaction Tj. In order to indicate that a read-write transaction Ti is allowed to read

“forward” on the current database state, we associate a read forward flag or RFF to Ti. If RFF(Ti) is set to false, it means that Ti has completed its read forward phase (RFP). Determining whether the RFP of an active read-write transaction Ti has ended is carried out by MVCC-IBOT’s CCR processing algorithm which additionally pre-validates Ti: 6.3. A New Suite of MVCC Protocols 185

1 begin 2 RFSTS(Ti) ←− ∞; 3 foreach Tj in CCR do 4 if RFF is set to true then 5 if Serializability Condition 2 is violated then 6 RFF(Ti) ←− false; 7 RFSTS(Ti) ←− CTS(Tj); 8 goto line 10

9 else 10 if Serializability Condition 1 holds then 11 if Tj rw-depends on Ti then 12 insert Tj along with ST(Tj) into ST(Ti); 13 if ST(Ti) ∩ PT(Ti) 6= 0/ then 14 abort Ti

15 else 16 abort Ti

17 end

Algorithm 6.5: CCR processing and transaction validation under MVCC-IBOT.

Note that Algorithm 6.5 assumes that transactions contained in CCR are chronologically ordered according to their commit timestamps and that transactions are processed in that order.

Assigning object versions to read operations is handled by MVCC-IBOT’s scheduler in the following manner: 186 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

1. Read Rule: A read operation ri[x] is processed as follows:

(a) If RFF is set to true, ri[x] is translated into ri[xk], where xk is the most recent version of x received by the client.

(b) Otherwise, i.e., if RFF is set to false, ri[x] is mapped into ri[xk], where xk is the most recent version of x that was created by a read-write transaction Tk such that CTS(Tk) < RFSTS(Ti).

(c) If there exists a write operation w j[x j] in MVH such that Tj ∈ Tactive(Ti) and Tj rw-depends on Ti, i.e., it has overwritten the object version xk observed by Ti (xk  x j), then the read operation is rejected and Ti is aborted.

(d) To record the information that Tk precedes Ti in any serial order of the transactions in MVH, the scheduler inserts Tk into PT(Ti).

2. Write Rule: See Algorithm 6.1.

Algorithm 6.6: MVCC-IBOT’s scheduling algorithm.

MVCC-IBOT’s scheduling algorithm differs from the one used by the MVCC-BOT and

MVCC-BOTO schemes w.r.t. the following three facts: (a) Most obviously, it lacks a start rule since recording transactions’ start timestamps are not required any more for CC; (b) the way how the MVCC-IBOT’s scheduler maps transactions’ read operations to corresponding object version reads is handled on the basis of the actual conflict situation in the system rather than by an a-priory specified version function and (c) most importantly, transactions being validated cannot be aborted until RFF is set to false since Serializability Condition 1 needs to be checked by the MVCC-IBOT’s scheduler only if ST(Ti) is non-empty. As there won’t be any transaction in ST(Ti) until RFF has been changed to false, Ti will never be aborted before this point.

As for the MVCC-BOT scheme, after a read-write transaction Ti has successfully executed its last data operation, Ti pre-commits and the local transaction manager initiates Ti’s final validation by sending a FVM to the server. MVCC-IBOT’s FVMs have the same contents as those transferred to the server by the MVCC-BOT scheme with the difference that the former additionally contains the value of Ti’s RFF parameter. Upon FVM(Ti) is arrived at the server, Ti’s is validated by applying

Algorithm 6.5 to all read-write transactions Tj where CTS(Tj) > TSc(CCRlast ). If the validation check succeeds, Ti’s effects become visible to other transactions and a commit notification is sent to the client. Otherwise, Ti aborts and a respective abort message will be delivered. 6.3. A New Suite of MVCC Protocols 187

So far, we have not provided proof that MVCC-IBOT operates correctly in the sense that it produces only serializable histories and it fulfills the IBOT data currency criterion. In what follows, we show that MVCC-IBOT meets its promised guarantees by proving the following theorem:

Theorem 10. Every history generated by MVCC-IBOT is serializable and each read operation of any committed read-write transaction Ti is done from a database state as it existed somewhere between Ti’s starting and its commit point (including them).

Proof. The proof consists of two parts. First, we show that MVCC-IBOT’ read rule ensures IBOT data currency. Second, we proof that MVCC-IBOT produces only serializable histories.

Part A: Let Ti denote a read-write transaction with RFSTS(Ti) being the logical time when an object version xk previously read by Ti was first overwritten by some read-write transaction

Tj ∈ Tactive(Ti), i.e., xk  x j, DSRFSTS(Ti) being the database state as it existed at RFSTS(Ti), and

DSEOT(Ti) representing the database state as it exists at Ti’s commit point. We will now show that all data objects read by Ti belong either to DSRFSTS(Ti) or to DSEOT(Ti) and that MVCC-IBOT ensures therefore IBOT data currency. Which object version MVCC-IBOT’s scheduler selects when mapping read operations to actual object version reads depends on the state of RFF(Ti). If RFF(Ti) is set to true, MVCC-IBOT’s read rule ensures that Ti saw the database state that was up-to-date before the updates of a read-write transaction Tj ∈ Tactive(Ti) that invalidated Ti were incorporated into the database. If RFF(Ti) is still set to true at Ti’s commit point, Ti has read from DSEOT(Ti).

Otherwise, Ti has seen DSRFSTS(Ti). Object versions read before Ti’s invalidation had not been updated prior to RFSTS(Ti) and thus, belong to DSRFSTS(Ti). Since DSEOT(Ti) and DSRFSTS(Ti) are at least as recent as DSSTS(Ti) and DSEOT(Ti), by definition, is identical to the database state at EOT(Ti), MVCC-IBOT provides IBOT data current guarantees.

Part B: Again, let MVH denotes a serializable multi-version history with MVSG(MVH) being its multi-version serialization graph. Because MVH is serializable, it follows that MVSG(MVH) is acyclic. Now suppose, by way of contradiction, that MVSG(MVH) contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti and Tjn with n ≥ 0 denote read-write transactions that have been executed by the MVCC-IBOT scheme. 188 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

(1) Cycle consists of two transactions only: Let us initially assume that the cycle involves only

two transactions, namely Ti and Tj0 , and is formed due to either of the following dependencies

between them: (a) Ti and Tj0 wr-depend on each other, (b) Ti and Tj0 rw-depend on each other,

or (c) Ti wr-depends on Tj0 and Tj0 rw-depends on Ti or vice versa.

(a) Suppose that Ti wr-depends on Tj0 . Because MVCC-IBOT ensures that read-write trans-

actions observe committed data only, it follows that the ordering relation c j0

holds. Suppose further that Tj0 wr-depends on Ti. Then, it follows that the ordering re-

lation ci

MVSG(MVH) cannot contain the cycle hTi δ Tj0 δ Tii, if the MVCC-IBOT scheme is used.

(b) Suppose that Ti rw-depends on Tj0 , ci precedes c j0 in MVH, and Tj0 does not rw-depend on

Ti by the time the former is informed about Ti’s commit. Then, it follows that Ti is a member

of ST(Tj0 ) according to Algorithm 6.5. Now suppose at a later stage Tj0 overwrites some

object version observed by Ti which implies that Tj0 now rw-depends on Ti. Then, however, MVCC-IBOT’s write rule (i.e., Point (3(a)i) of Algorithm 6.1) would be violated if this

operation were allowed to occur. Suppose otherwise that Tj0 had already rw-depended

on Ti by the time Tj0 was validated against Ti. Then, Serializability Condition 2 or the

condition specified at line 13 of Algorithm 6.5 would be violated if Tj0 were not aborted.

Alternatively, suppose that Ti rw-depends on Tj0 , c j0 precedes ci in MVH, and Tj0 does not

rw-depend on Ti by the time the former gets to know about Ti’s commit. Now suppose

at a later stage Ti observes some object version whose successor version was installed by

Tj0 , thus Tj0 rw-depends on Ti. Then, however, MVCC-IBOT’s read rule (i.e., Point (1b))

would be violated if this read operation were not rejected. Now suppose that Tj0 had already

rw-depended on Ti by the time the latter was validated against Tj0 . Then, Serializability

Condition 1 or the condition specified at line 13 of Algorithm 6.5 would be violated if Ti rw rw

were not aborted. Thus, the cycle hTi δ Tj0 δ Tii cannot be produced under the MVCC- 6.3. A New Suite of MVCC Protocols 189

IBOT scheme.

(c) Suppose Ti wr-depends on Tj0 . Then, it follows that c j0

RFSTS(Ti) > CTS(Tj0 ). Now suppose further that Tj0 rw-depends on Ti which im-

plies that bi precedes c j0 and CTS(Tj0 ) ≥ RFSTS(Ti), obviously leading to a contradiction.

Using the same line of reasoning, we can actually show that the dependencies Ti rw-

depends on Tj0 and Tj0 wr-depends on Ti cannot co-exist without that the MVCC-IBOT rw wr wr rw

scheme is violated. Thus, the cycles hTi δ Tj0 δ Tii and hTi δ Tj0 δ Tii cannot occur in MVSG(MVH) and, therefore, it may not contain a cycle involving exactly two transaction

vertices.

(2) Cycle consists of three or more transactions: Now suppose that the cycle involves three or even

more transactions. Irrespective of the size of the cycle, it must have an edge Ti → Tj0 . This

edge occurs when Tj0 wr- or rw-depends on Ti.

(a) Let us initially assume that Tj0 wr-depends on Ti which implies which implies that

ci CTS(Ti), and Ti belongs to PT(Tj0 ) according to MVCC- IBOT’s read rule (i.e., Point (1d)).

I) Suppose also that Tjn (either directly or indirectly) wr-depends on Tj0 which implies

that c j0 CTS(Tj0 ), and Tj0 ∈ PT(Tjn ).

i) Now suppose that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti), leading to a contradiction since ci can-

not precede and succeed c jn at the same time according to Point (3) of Definition 9. wr wr wr wr

Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be produced in MVSG(MVH).

ii) On the contrary, suppose that Ti rw-depends on Tjn which implies that b jn precedes

ci and CTS(Ti) ≥ RFSTS(Tjn ). Since the conditions CTS(Ti) ≥ RFSTS(Tjn ) and

RFSTS(Tjn ) > CTS(Tj0 ) hold, it follows that CTS(Ti) > CTS(Tj0 ) holds too, which,

in turn, implies the ordering relation c j0

ordering relation ci

II) Consequently, the cycle may only be produced when Tjn (either directly or

indirectly) rw-depends on Tj0 which implies that b j0 precedes c jn and that

CTS(Tjn ) ≥ RFSTS(Tj0 ).

i) Suppose further that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the conditions

CTS(Tjn ) ≥ RFSTS(Tj0 ) and RFSTS(Tj0)>CTS(Ti ) hold, it follows that the con-

dition CTS(Tjn ) > CTS(Ti) holds too, which, in turn, implies the ordering relation

ci

above, thus the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH).

ii) To conclude this series of possible cycles, suppose that Ti rw-depends on Tjn which

implies that b jn precedes ci and CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations

b jn

? Suppose that Tj0 commits before Tjn and that Tjn does not rw-depend on Tj0 by the

time Tjn gets to know about Tj0 ’s commit. Suppose further that at some later point in

time Tjn overwrites some object version observed by Tj0 . Then, however, MVCC- IBOT’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the

write operation were not rejected. Suppose, alternatively, that Tjn had already rw-

depended on Tj0 when Tjn was validated against Tj0 . In this situation Serializability

Condition 1 or 2 of Algorithm 6.5 would be violated if Tjn were not aborted. Thus, wr rw rw rw

the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created in MVSG(MVH) when Tj0

commits before Tjn .

? Suppose, on the contrary, that Tj0 commits after Tjn and that Tjn does not rw-depend

on Tj0 by the time Tj0 gets to know about Tjn ’s commit. Suppose further that at

some later point in time Tj0 observes some object version whose successor version

was installed by Tjn , thus Tjn rw-depends on Tj0 . Then, however, MVCC-IBOT’s read rule (i.e., Point (1c)) would be violated if the read operation were not rejected.

Suppose, on the contrary, that Tjn had already rw-depended on Tj0 when the latter 6.3. A New Suite of MVCC Protocols 191

was validated against Tjn . Then, however, Serializability Condition 2 or the condi-

tion specified at line 13 of Algorithm 6.5 would be violated if Tj0 were not aborted. wr rw rw rw

Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created when Tj0 commits

after Tjn .

(b) Alternatively, the cycle can be produced when Tj0 rw-depends on Ti which implies that bi

precedes c j0 in MVH and CTS(Tj0 ) ≥ RFSTS(Ti).

I) Now suppose that Tjn (either directly or indirectly) wr-depends on Tj0 which implies

that c j0 CTS(Tj0 ), and Tj0 ∈ PT(Tjn ).

i) Suppose further that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the conditions

RFSTS(Ti) > CTS(Tjn ) and CTS(Tj0 ) ≥ RFSTS(Ti) hold, it follows that

CTS(Tj0 > CTS(Tjn ) holds too. This, however, contradicts the previously de-

rived ordering relation c j0

MVCC-IBOT scheme is used. As a result, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii

cannot occur in MVSG(MVH) when Ti wr-depends on Tjn .

ii) On the contrary, suppose that Ti rw-depends on Tjn which implies that b jn

precedes ci in MVH and CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations

b jn

? Suppose that Ti commits before Tjn and that Ti does not rw-depend on Tjn by the

time Tjn gets to know about Ti’s commit. Suppose further that at some later point

in time Tjn observes some object version whose successor version was installed by

Ti, thus Ti now rw-depends on Tjn . Then, however, MVCC-IBOT’s read rule (i.e., Point (1c)) would be violated if the read operation were not rejected. Suppose,

alternatively, that Ti had already rw-depended on Tjn when the latter was validated

against Ti. Then, however, Serializability Condition 2 or the condition specified at

line 13 of Algorithm 6.5 would be violated if Tjn were not aborted. That is, the rw wr wr rw

cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH) when Ti commits 192 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

before Tjn .

? Suppose, on the contrary, that Ti commits after Tjn and that Ti does not rw-depend

on Tjn by the time the former gets to know about Tjn ’s commit. Suppose further that

at some later stage Ti overwrites some object version observed by Tjn , thus Ti now

rw-depends on Tjn . Then, however, MVCC-IBOT’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected. Suppose,

alternatively, that Ti had already rw-depended on Tjn when Ti was validated against

Tjn . Then, however, Serializability Condition 1 or 2 of Algorithm 6.5 would be rw wr wr rw

violated if Ti were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot

occur in MVSG(MVH) despite the fact that Ti commits after Tjn .

II) Now suppose that Tjn (either directly or indirectly) rw-depends on Tj0 which implies

that b j0 precedes c jn and CTS(Tjn ) ≥ RFSTS(Tj0 ).

i) Suppose also that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the ordering relations b j0

and bi

? Suppose that Ti commits before Tj0 and that Tj0 does not rw-depend on Ti by the

time the former gets to know about Ti’s commit. Suppose further at some later

point in time Tj0 overwrites some object version observed by Ti, thus Tj0 now rw-

depends on Ti. Then, however, MVCC-IBOT’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected. Sup-

pose, alternatively, that Tj0 had already rw-depended on Ti when the former was

validated against Ti. In this particular situation, Serializability Condition 1 or 2

of Algorithm 6.5 would be violated if Tj0 were not aborted. That is, the cycle rw rw rw wr

hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH) if Ti commits before Tj0 .

? Suppose, on the contrary, that Ti commits after Tj0 and that Tj0 does not rw-depend

on Ti by the time the latter gets to know about Tj0 ’s commit. Suppose further at

some later point in time Ti observes some object version whose successor version 6.3. A New Suite of MVCC Protocols 193

was installed by Tj0 , thus Tj0 now rw-depends on Ti. Then, however, MVCC- IBOT’s read rule (i.e., Point (1c)) would be violated if the read operation were

not rejected. Suppose, alternatively, that Tj0 had already rw-depended on Ti when

the latter was validated against Tj0 . Then, however, Serializability Condition 2 or

the condition specified at line 13 of Algorithm 6.5 would be violated if Ti were rw rw rw wr

not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created in

MVSG(MVH) even if Ti commits after Tj0 .

ii) Finally suppose that Ti rw-depends on Tjn which implies that b jn precedes ci and

CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations bi

and b jn

commit order is ci

time the former gets to know about Tj0 ’s commit. Suppose further at some later point

in time Tjn overwrites some object version observed by Tj0 , thus Tjn now rw-depends

on Ti. Then, however, MVCC-IBOT’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected. Suppose, alternatively,

that Tjn had already rw-depended on Tj0 when the former was validated against Tj0 . Then, again Serializability Condition 1 or 2 of Algorithm 6.5 would be violated if rw rw rw rw

Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH). Consequently, MVSG(MVH) is acyclic and, therefore, MVCC-IBOT

produces only correct histories in the sense that they are serializable.



We conclude this subsection by investigating the relationship between MVCC-BOT and

MVCC-IBOT. MVCC-IBOT appears to be less restrictive than MVCC-BOT (protocol P1 is said to be more restrictive than another protocol P2 if P1 permits fewer histories than P2), leading us to of the following theorem: 194 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Theorem 11. MVCC-IBOT’s consistency and currency definitions are less restrictive than those of the MVCC-BOT scheme.

Proof. We first show that there exist histories that are allowed by MVCC-IBOT, but are disallowed by MVCC-BOT. For this purpose we use history MVH7 illustrated below:

Example 8.

MVH7 = bc1 b1 r1[x0] bc2 b2 r2[y0] bc3 r2[z0] w2[z2] c2 bc4 r1(z2) w1[x1] c1 [x0  x1,z0  z2]



MVH7 is disallowed by MVCC-BOT as T1 observed object version z2 which had been installed by T2 after T1’s starting point (it can easily be seen that STS(T1) ≤ CTS(T2)). Unlike the MVCC-

BOT scheme, MVCC-IBOT’s scheduler allows T1 to read “forward” on T2 and thus seeing its effects as the latter had not overwritten any object version observed by T1, and therefore Serializability

Condition 2 is not violated. Further, as the operations in MVH7 do not violate MVCC-IBOT’s read and write rule, MVH7 is allowed by MVCC-IBOT. It remains to proof that all histories allowed by MVCC-BOT are allowed by MVCC-IBOT as well. This proof is straightforward as both protocols enforce (among other things) that a val- idating transaction Ti may not violate Serializability Condition 1 in order to be committed. In both protocols the validity of this condition can be easily verified by intersecting Ti’s write set with the read sets of all transactions in ST(Ti). Under the MVCC-BOT scheme, this condition must hold for all concurrently active transactions that committed during Ti’s execution time, i.e.,

|ST(Ti)| = |Tactive(Ti)|. Under the MVCC-IBOT scheme, however, Serializability Condition 1 is to be valid only for those transactions in Tactive(Ti) that committed after RFF(Ti) had been set to false, i.e., |ST(Ti)| ≤ |Tactive(Ti)|. Therefore, transactions executed under the MVCC-IBOT protocol are more likely to pass their validation checks than those run under the MVCC-BOT scheme and hence,

MVCC-IBOT permits more correct histories.  6.3. A New Suite of MVCC Protocols 195

6.3.4 Optimizing the MVCC-IBOT Scheme

The MVCC-IBOT algorithm as previously presented does not yet include any optimization routines.

Similarly to MVCC-BOT, MVCC-IBOT’s performance may suffer from spurious data conflicts between a validating transaction Ti and already validated transactions that are recorded in ST(Ti).

Such erroneously identified conflicts occur if an already validated and committed transaction Tj that can be serialized before Ti is “erroneously” assigned into ST(Ti) in lieu of PT(Ti). The following example illustrates such a situation:

Example 9.

MVH8 = bc1 b1 r1[z0] b2 r2[y0] bc2 w1[z1] b3 b4 r3[y0] c1 r4[z1] r4[x0] bc3 w4[z4] c4 w2[y2]

r3[x0] c2 bc4 w3[x3] c3 [x0  x3,y0  y2,z0  z1  z4]

In MVH8 each read-write transaction is executed with serializability correctness and IBOT data currency guarantees. However, MVH8 should not be produced by the MVCC-IBOT scheme despite the fact that MVSG(MVH8) is acyclic and each read operation in MVH8 satisfies MVCC-IBOT’s data currency criterion. The MVCC-IBOT scheduler would actually abort T3 since its CCR process- ing and transaction validation algorithm, i.e., Algorithm 6.5, (erroneously) assigns T4 into ST(T3) instead of PT(T3), and T3’s write operation w3[x3] conflicts with T4’s read operation r4[x0]. But, as the MVSG(MVH8) in Figure 6.4 shows, T4 could actually be serialized before T3 and should therefore belong to PT(T3). 

T wr,ww wr,ww 0 T T 1 wr,ww wr,ww 2

wr,ww rw T T 3 rw 4

Figure 6.4: Multi-version serializability graph of MVH8.

As the previous example has shown, MVCC-IBOT falls short of recognizing whether a vali- dated transaction Tj ∈ Tactive(Ti) with CTS(Tj) > RFSTS(Ti) can be serialized before Ti and, there- fore, should be assigned to PT(Ti). A protocol that eliminates MVCC-IBOT’s problem is called 196 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

MVCC-IBOTO and will be described in remainder of this subsection. Like MVCC-BOTO, MVCC-

IBOTO exploits the CC information periodically delivered to clients when identifying recently com-

O mitted transactions that can be safely serialized before Ti. MVCC-IBOT assigns a validated trans- action Tj ∈ Tactive(Ti) into PT(Ti) if RFF is set to false and, additionally, Serializability Conditions 2 and 3 are satisfied.

The algorithm used by MVCC-IBOTO for transaction classification and validation is depicted below:

1 begin 2 RFSTS(Ti) ←− ∞; 3 foreach Tj in CCR do 4 if RFF is set to true then 5 if Serializability Condition 2 is violated then 6 RFF(Ti) ←− false; 7 RFSTS(Ti) ←− CTS(Tj); 8 goto line 10

9 else 10 if Serializability Conditions 2 and 3 hold then 11 insert Tj into PT(Ti) 12 else 13 if Serializability Conditions 1 and 4 are satisfied then 14 insert Tj into ST(Ti) 15 else 16 abort Ti

17 end

Algorithm 6.7: CCR processing and transaction validation under MVCC-IBOTO.

It is important to note that MVCC-IBOTO does not store any validated transaction

Tj ∈ Tactive(Ti) with CTS(Tj) < RFSTS(Ti) into PT(Ti) since those transactions can be safely serialized before Ti and, therefore, need not be recorded in PT(Ti). However, those read-write transactions that terminated with commit timestamps > RFSTS(Ti) need to be classified and their identifiers along with their read and write sets must be recorded either in PT(Ti) or ST(Ti). Like the MVCC-BOTO scheme, MVCC-IBOTO groups recently committed transactions into either of the two sets once their successful termination is broadcast by means of a CCR. At this point, Ti 6.3. A New Suite of MVCC Protocols 197 typically has not yet issued its last read operation and, therefore, classification decisions are based on incomplete knowledge of Ti’s final read set. To address the problem, we extend MVCC-IBOT’s scheduling algorithm by an additional validation routine being integrated into the scheduler’s read rule that ensures that previously made transaction classifications are re-examined after Ti has pro- cessed a further read operation. Remember that this approach is exercised by the MVCC-BOTO scheme as well. The complete scheduling algorithm of the MVCC-IBOTO scheme is finally illus- trated below:

1. Read Rule: A read operation ri[x] is processed as follows:

(a) If RFF is set to true, a read operation ri[x] is translated into ri[xk], where xk is the most recent version of object x received by the client.

(b) Otherwise, i.e., if RFF is set to false, a read operation ri[x] is mapped into ri[xk], where xk is the most recent object version created by some transaction Tk such that CTS(Tk) < RFSTS(Ti). (c) Also, if RFF is set to false, Read Rule 2c of Algorithm 6.4 is enforced.

(d) To record the information that Tk precedes Ti in any serial history, the scheduler inserts Tk into PT(Ti). 2. Write Rule: See Algorithm 6.1.

Algorithm 6.8: MVCC-IBOTO’s scheduling algorithm.

O As for the MVCC-BOT scheme, the server needs Ti’s ST information in order to efficiently perform its final transaction validation. It can actually be obtained in two ways: (a) Either the client sends that information to the server or (b) the server itself computes the set of transactions that (either directly or indirectly) rw-depend on Ti, and, therefore, cannot be serialized before it. Availability of these two options allows clients to trade off network bandwidth for additional CPU overheads at the server and vice versa. Since we expect ST(Ti) to be relatively small in size, whereas its associated computational costs are expected to be fairly large, the recommended default is to send such data piggybacked on FVM(Ti) to the server. Again, we conclude this subsection by showing that MVCC-IBOTO produces only correct his- tories as stated in Theorem 12 below:

Theorem 12. MVCC-IBOTO produces only correct multi-version histories in the sense that they 198 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly are serializable and any committed read-write transaction Ti in MVH has observed a consistent snapshot of the database as it existed somewhere between Ti’s starting and commit point (including them).

Proof. The following proof is again split into two parts. We first show that MVCC-IBOTO ensures

IBOT data currency to read-write transactions and thereafter, we provide evidence that it produces only serializable histories.

Part A: Similarly to the proof of Theorem 10, we start by proving that MVCC-IBOTO enforces

IBOT data currency to any committed read-write transaction Ti. Which object versions Ti actually reads during its execution is determined by MVCC-IBOTO’s read rule. In the proof of Theorem 10, we have shown that MVCC-IBOT’s read rule provides IBOT data currency to read-write transac- tions. Since MVCC-IBOTO’s read rule contains only a slight modification of MVCC-IBOT’s rule, it is sufficient to show that the modified part does not violate the IBOT data currency criterion.

Compared to the MVCC-IBOT scheme, MVCC-IBOTO’ read rule contains a modified version of

Statement (1c). Since the operations enforced by Statement (1c) do not influence the way how

MVCC-IBOTO maps read operations to object version reads, but are rather concerned with check- ing an active transaction’s serializability, it follows that MVCC-IBOTO ensures IBOT data currency guarantees too.

Part B: We will now show that MVCC-IBOTO produces only serializable histories. Again, let MVH denote a serializable multi-version history with MVSG(MVH) being its multi-version serialization graph. Since MVH is serializable, it follows that MVSG(MVH) is acyclic. Now suppose, by way of contradiction, that MVSG(MVH) contains a cycle hTi → Tj0 → ... → Tjn → Tii, where Ti and Tjn with n ≥ 0 denote read-write transactions that have been executed by the MVCC- IBOTO scheme.

(1) Cycle consists of two transactions only: As in previous proofs, we start by assuming that the

cycle involves only two transactions, namely Ti and Tj0 , i.e., n = 0. Then, to form the cycle,

one of the following conflict dependencies between Ti and Tj0 must hold: (a) Ti and Tj0 wr-

depend on each other, (b) Ti and Tj0 rw-depend on each other, or (c) Ti wr-depends on Tj0 and 6.3. A New Suite of MVCC Protocols 199

Tj0 rw-depends on Ti or vice versa.

O (a) Suppose that Ti wr-depends on Tj0 . Because MVCC-IBOT ensures that read-write trans-

actions observe committed data only, it follows that the ordering relation c j0

holds (provided that Tj0 installs an object version x j0 and Ti later reads the created version).

Since Tj0 wr-depends on Ti, it follows that the ordering relation ci

(provided that Ti installs an object version xi and Tj0 later reads the created version). This, however, leads to a contradiction since, according to Points (3) and (4) of Definition 9, a

transaction can only commit after it was initiated. Thus, MVSG(MVH) cannot contain the wr wr O cycle hTi δ Tj0 δ Tii, if the MVCC-IBOT scheme is used.

(b) Suppose that Ti rw-depends on Tj0 and provided that ci precedes c j0 and Tj0 does not

rw-depend on Ti by the time the former is informed about Ti’s commit, it follows that

RFSTS(Tj0 ) ≤ CTS(Ti) and Ti is a member of ST(Tj0 according to Algorithm 6.7. Now

suppose at a later stage Tj0 overwrites some object version observed by Ti which implies O that Tj0 now rw-depends on Ti. Then, however, MVCC-IBOT ’s write rule (Point (3(a)i) of Algorithm 6.1) would be violated if this operation were allowed to occur. Suppose oth-

erwise that Tj0 had already rw-depended on Ti by the time Tj0 was validated against Ti.

Then, Serializability Condition 1 or 2 of Algorithm 6.7 would be violated if Tj0 were not

aborted. Alternatively, suppose that Ti rw-depends on Tj0 , c j0 precedes ci in MVH, and Tj0

does not rw-depend on Ti by the time the former gets to know about Ti’s commit. Now

suppose at a later stage Ti observes some object version whose successor version was in-

stalled by Tj0 , thus Tj0 rw-depends on Ti. Then, however, MVCC-IBOT’s read rule (i.e., Point (1c)) would be violated if this read operation were not rejected. Now suppose that

Tj0 had already rw-depended on Ti by the time the latter was validated against Tj0 . Then,

Serializability Condition 1 or 2 of Algorithm 6.7 would be violated if Ti were not aborted. rw rw O Thus, the cycle hTi δ Tj0 δ Tii cannot be produced under the MVCC-IBOT scheme.

(c) Suppose that Ti wr-depends on Tj0 . Then, it follows that RFSTS(Ti) > CTS(Tj0 ).

Now suppose that Tj0 rw-depends on Ti which implies that bi precedes c j0 and 200 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

CTS(Tj0 ) ≥ RFSTS(Ti), leading to a contradiction. Using the same line of reasoning, we

can show that the dependencies Ti rw-depends on Tj0 and Tj0 wr-depends on Ti cannot co- rw wr O exist without that the MVCC-IBOT scheme is violated. Thus, the cycles hTi δ Tj0 δ Tii wr rw

and hTi δ Tj0 δ Tii cannot occur in MVSG(MVH) and, therefore, it may not contain any cycle involving exactly two transactions.

(2) Cycle consists of three or more transactions: In the more complex case, the cycle may involve

three or more read-write transactions. Irrespective of how many transactions form the cycle, it

must have an edge Ti → Tj0 .

(a) Let us initially assume there is a wr-edge from Ti to Tj0 , i.e., Tj0 wr-depends on Ti which

implies that ci CTS(Ti), and Ti belongs to PT(Tj0 ).

I) Besides suppose that Tjn (either directly or indirectly) wr-depends on Tj0 which im-

plies that c j0 CTS(Tj0 ), and Tj0 ∈ PT(Tjn ).

i) Suppose also that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti), leading to a contradiction since ci can- wr wr

not precede and succeed c jn at the same time. As a result, the cycle hTi δ Tj0 δ wr wr

... δ Tjn δ Tii cannot be formed in MVSG(MVH).

ii) Suppose, alternatively, that Ti rw-depends on Tjn which implies that b jn precedes

ci and CTS(Ti) ≥ RFSTS(Tjn ). Since the conditions CTS(Ti) ≥ RFSTS(Tjn ) and

RFSTS(Tjn ) > CTS(Tj0 ) hold, it follows that CTS(Ti) > CTS(Tj0 ) holds too, which,

in turn, implies the ordering relation c j0

ordering relation ci

II) Consequently, the cycle may only occur when Tjn (either directly or indirectly) rw-

depends on Tj0 which implies that b j0 precedes c jn and CTS(Tjn ) ≥ RFSTS(Tj0 ).

i) Suppose further that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the conditions

CTS(Tjn ) ≥ RFSTS(Tj0 ) and RFSTS(Tj0 ) > CTS(Ti) hold, it follows that the 6.3. A New Suite of MVCC Protocols 201

condition CTS(Tjn ) > CTS(Ti) holds too, which, in turn, implies the ordering

relation ci

derived above. Consequently, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH) as well.

ii) To conclude this series of possible cycles, suppose that Ti rw-depends on Tjn which

implies that b jn precedes ci and CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations

b jn

? Suppose that Tj0 commits before Tjn and that Tjn does not rw-depend on Tj0 by the

time Tjn gets to know about Tj0 ’s commit. Suppose further that at some later point

in time Tjn overwrites some object version observed by Tj0 , thus Tjn rw-depends O on Tj0 . Then, however, MVCC-IBOT ’s write rule (i.e., Point (3(a)ii) of Algo- rithm 6.1) would be violated if the write operation were not rejected. Suppose, al-

ternatively, that Tjn had already rw-depended on Tj0 when Tjn was validated against

Tj0 . In this case Serializability Condition 1 or 3 of Algorithm 6.7 would be vio- wr rw rw rw

lated if Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be

created in MVSG(MVH) when Tj0 commits before Tjn .

? Suppose, on the contrary, that Tj0 commits after Tjn and that Tjn does not rw-depend

on Tj0 by the time the latter gets to know about Tjn ’s commit. Suppose further that at

some later point in time Tj0 observes some object version whose successor version O was installed by Tjn , thus Tjn rw-depends on Tj0 . Then, however, MVCC-IBOT ’s read rule (i.e., Point (1c)) would be violated if the read operation were not rejected.

Suppose, alternatively, that Tjn had already rw-depended on Tj0 when Tj0 was vali-

dated against Tjn . Then, however, Serializability Condition 2 or 4 of Algorithm 6.7 wr rw rw rw

would be violated if Tj0 were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii

cannot be created when Tj0 commits after Tjn .

(b) What remains to be shown is that the cycle cannot be formed when Tj0 rw-depends on Ti. If

Tj0 rw-depends on Ti, it follows that bi precedes c j0 in MVH and CTS(Tj0 ) ≥ RFSTS(Ti). 202 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

I) Now suppose that Tjn (either directly or indirectly) wr-depends on Tj0 which implies

that c j0 CTS(Tj0 ), and Tj0 ∈ PT(Tjn ).

i) Suppose further that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the conditions

RFSTS(Ti) > CTS(Tjn ) and CTS(Tj0 ) ≥ RFSTS(Ti) hold, it follows that

CTS(Tj0 > CTS(Tjn ) holds too. This, however, contradicts the the previously rw wr wr wr

derived ordering relation c j0

cannot occur in MVSG(MVH) when Ti wr-depends on Tjn .

ii) Suppose, on the contrary, that Ti rw-depends on Tjn which implies that b jn

precedes ci in MVH and CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations

b jn

? Suppose that Ti commits before Tjn and that Ti does not rw-depend on Tjn by the

time Tjn gets to know about Ti’s commit. Suppose further that at some later point

in time Tjn observes some object version whose successor version was installed O by Ti, thus Ti rw-depends on Tjn . Then, however, MVCC-IBOT ’s read rule (i.e., Point (1c)) would be violated if the read operation were not rejected. Suppose,

alternatively, that Ti had already rw-depended on Tjn when the latter was validated

against Ti. Then, however, Serializability Condition 2 or 4 of Algorithm 6.7 would rw wr wr rw

be violated if Tjn were not aborted. That is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii

cannot occur in MVSG(MVH) when Ti commits before Tjn .

? Suppose, on the contrary, that Ti commits after Tjn and that Ti does not rw-depend

on Tjn by the time Ti gets to know about Tjn ’s commit. Suppose further that at some

later stage Ti overwrites some object version observed by Tjn , thus Ti now rw- O depends on Tjn . Then, however, MVCC-IBOT ’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected. Suppose,

alternatively, that Ti had already rw-depended on Tjn when Ti was validated against

Tjn . Then, however, Serializability Condition 1 or 3 of Algorithm 6.7 would be 6.3. A New Suite of MVCC Protocols 203

rw wr wr rw

violated if Ti were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot

occur in MVSG(MVH) even if Ti commits after Tjn .

II) Now suppose that Tjn (either directly or indirectly) rw-depends on Tj0 which implies

that b j0 precedes c jn and CTS(Tjn ) ≥ RFSTS(Tj0 ).

i) Suppose also that Ti wr-depends on Tjn which implies that c jn

RFSTS(Ti) > CTS(Tjn ), and Tjn ∈ PT(Ti). Since the ordering relations b j0

and bi

? Suppose that Ti commits before Tj0 and Tj0 does not rw-depend on Ti by the time

Tj0 gets to know about Ti’s commit. Suppose further at some later point in time Tj0

overwrites some object version observed by Ti, thus Tj0 rw-depends on Ti. Then, however, MVCC-IBOTO’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would

be violated if the write operation were not rejected. Suppose, alternatively, that

Tj0 had already rw-depended on Ti when Tj0 was validated against Ti. In this case

Serializability Condition 1 or 3 of Algorithm 6.7 would be violated if Tj0 were not rw rw rw wr

aborted. That is, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in MVSG(MVH)

if Ti commits before Tj0 .

? Suppose, on the contrary, that Ti commits after Tj0 and that Tj0 does not rw-depend

on Ti by the time Ti gets to know about Tj0 ’s commit. Suppose further at some later

point in time Ti observes some object version whose successor version was installed

O by Tj0 , thus Tj0 rw-depends on Ti. Then, however, MVCC-IBOT ’s read rule (i.e., Point (1c)) would be violated if the read operation were not rejected. Suppose,

alternatively, that Tj0 had already rw-depended on Ti when Ti was validated against

Tj0 . Then, again Serializability Condition 2 or 4 of Algorithm 6.7 would be violated rw rw rw wr

if Ti were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be created

in MVSG(MVH) even if Ti commits after Tj0 .

ii) Last, but not least, suppose that Ti rw-depends on Tjn which implies that b jn

precedes ci and CTS(Ti) ≥ RFSTS(Tjn ). Since the ordering relations bi

b j0

assume that the commit order is ci

depend on Tj0 by the time Tjn gets to know about Tj0 ’s commit. Suppose further at

some later point in time Tjn overwrites some object version observed by Tj0 , thus O Tjn rw-depends on Ti. Then, however, MVCC-IBOT ’s write rule (i.e., Point (3(a)ii) of Algorithm 6.1) would be violated if the write operation were not rejected. Sup-

pose, alternatively, that Tjn had already rw-depended on Tj0 when Tjn was validated

against Tj0 . In this case Serializability Condition 1 or 3 of Algorithm 6.7 would be rw rw rw rw

violated if Tjn were not aborted. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii can- not occur in MVSG(MVH). Consequently, MVSG(MVH) is acyclic and, therefore,

MVCC-IBOTO produces only correct histories in the sense that they are serializable.



6.3.5 MVCC-EOT Scheme

Previously described protocols do not provide strong data currency guarantees to read-write trans- actions which enforces that all object versions processed by their read operations are still up-to-date by the transaction’s commit point. Strong data currency guarantees along with the adherence to the serializability criterion are required in many conventional and mobile database-supported applica- tions such as stock market, air traffic control, factory floor, on-board airline databases, to name just a few. Apart from the data currency issue, a further motivating factor for yet another CC protocol is related to the relatively high space and time costs intrinsic to all previously proposed schemes making them unattractive for very resource-poor mobile clients. The protocol that eliminates those problems, though at the cost of performance degradation (see Section 6.5.4), is called MVCC-EOT.

As its name implies, MVCC-EOT provides clients with end-of-transaction data currency guaran- tees along with serializability correctness. Since the basic idea and implementation underlying

MVCC-EOT are akin to the invalidation-only method [123, 124], we only briefly sketch its core components. Like other protocols in the MVCC-* suite, MVCC-EOT exploits periodically dissem- 6.3. A New Suite of MVCC Protocols 205 inated CCRs in order to pre-validate active read-write transactions. Whenever a new CCR appears on the broadcast channel, MVCC-EOT, or, more precisely, the client transaction manager validates an active read-write transaction Ti against each read-write transaction Tj included into the report by the following algorithm:

1 begin 2 foreach Tj in CCR do 3 if Serializability Condition 2 is violated then 4 abort Ti

5 end

Algorithm 6.9: CCR processing and transaction validation under MVCC-EOT.

To enforce EOT data currency guarantees, MVCC-EOT aborts and subsequently restarts any active read-write transaction Ti that has observed a stale object version. With regard to the data currency guarantees, MVCC-EOT is obviously strictly more restrictive than MVCC-IBOT which results in permitting less correct histories. The reason is that the MVCC-IBOT protocol is less vul- nerable to invalidation notifications saying that a read-write transaction Tj ∈ Tactive(Ti) has overwrit- ten the object version previously read by the active read-write transaction Ti. While MVCC-EOT always response to such a message by immediately aborting Ti, for MVCC-IBOT it is merely an indication to terminate Ti’s RFP and to change the way the scheduler translates read operations into actual version read steps. Under the MVCC-IBOT scheme, an invalidating transaction Tj causes an active read-write transaction Ti to abort only in those cases when either (a) Ti rw-depends on Tj, or

(b) Tj rw-depends on Ti, another read-write transaction Tk is contained in ST(Tj), i.e., Tk wr- or rw- depends on Tj, and Ti, in turn, wr-depends on Tk, i.e., Tk ∈ PT(Ti). Therefore, the MVCC-IBOT’s performance is expected to be superior to that of the MVCC-EOT scheme. Determining whether

MVCC-BOT or MVCC-EOT produces more histories is not as straightforward as before since both protocols are incomparable. Both schemes ensure different data currency guarantees by employing dissimilar transaction validation algorithms. We will not further discuss this issue here and refer the interested reader to our simulation study that shows comparative performance of both protocols.

To complete the protocol’s description, we formulate the rules according to which the MVCC- 206 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

EOT’s scheduler processes read and write operations issued by an active read-write transaction Ti:

1. Read Rule: A read operation ri[x] is transformed into ri[xk], where xk is the latest committed version of x received by the client up to now.

2. Write Rule: A write operation wi[x] is transformed into wi[xi] and executed.

Algorithm 6.10: MVCC-EOT’s scheduling algorithm.

Similarly to other protocols in the MVCC-* suite, MVCC-EOT initiates Ti’s final validation as soon as Ti’s last data operation has been processed by the client. For that purpose, the client transaction manager transmits FVM(Ti) containing the following components to the server: (a)

ReadSet(Ti), (b) WriteSet(Ti), and (c) TSc(CCRlast ). It is interesting to note that final validation messages emitted by the MVCC-EOT scheme contain a proper subset of the information contained by other schemes of the MVCC-* suite. Thus, it incurs the lowest local storage and network costs among all protocols of the MVCC-* suite to mobile clients.

Last, but not least, we give the proof that MVCC-EOT produces only correct histories in the sense that they are serializable and all committed transactions fulfill EOT data currency guarantees.

Theorem 13. MVCC-EOT produces only serializable multi-version histories that ensure EOT data currency guarantees to read-write transactions, i.e., for each read operation ri[x j] ∈ ReadSet(Ti) of any committed read-write transaction Ti in MVH there exists no object version xk at Ti’s commit point such that x j  xk.

Proof. Again, the proof is divided into two parts. We start by proving that MVCC-EOT ensures

EOT data currency and then show that the protocol generates only serializable histories.

Part A: Let Ti denote a read-write transaction with EOT(Ti) being the logical time of Ti’s final validation, and DSEOT(Ti) representing the database state at Ti’s commit time. We claim that the values read by Ti correspond to DSEOT(Ti). Now suppose, by way of contradiction, that Ti has read an object version x j not belonging to the DSEOT(Ti), i.e., x j has been overwritten by some version xk and is therefore discarded from DSEOT(Ti). Then, however, Serializability Condition 2 of

Algorithm 6.10 would be violated and, therefore, Ti’s commit is not possible. Thus, MVCC-EOT provides EOT data currency guarantees. 6.3. A New Suite of MVCC Protocols 207

Part B: Let MVH denote any serializable multi-version history with MVSG(MVH) being its multi-version serialization graph. As MVH is serializable, MVSG(MVH) is acyclic. Let us now suppose, by means of contradiction, that MVSG(MVH) contains a cycle of the form hTi → Tj0 → ... → Tjn → Tii, where Ti and Tjn with n ≥ 0 denote read-write transactions that have been processed under the MVCC-EOT scheme. Then, in order for the cycle to be produced, Ti must have both an incoming and outgoing edge. Thereby, the outgoing edge Ti → Tj0 is from a read-write transaction Ti that either observed an object version whose successor version was installed by Tj0 or that installed an object version read by Tj0 and the incoming edge Tjn → Ti is to a read-write transaction Ti that either created the successor object version read by Tjn or observed an object ver- sion written by Tjn . It is important to note that Tj0 and Tjn do not necessarily need to be distinct transactions.

(a) To start with, initially suppose that Tj0 rw-depends on Ti.

I) Suppose further that Ti commits after Tj0 , i.e., c j0

tween Tj0 and Ti had already existed by the time Ti was validated against Tj0 . Then,

Serializability Condition 2 of Algorithm 6.10 would be violated if Ti were not aborted.

Suppose, alternatively, that Tj0 had not rw-depended on Ti by the time when the latter

was validated against Tj0 . Then, however, MVCC-EOT’s read rule would be violated if rw ? ? ?

Ti had missed Tj0 ’s effects. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot occur in

MVSG(MVH) provided that the condition c j0

II) Now suppose that Tj0 rw-depends on Ti and Ti commits before Tj0 , i.e., ci

pose also that Tjn (either directly or indirectly) wr- or rw-depends on Tj0 which implies that

the ordering relation c j0

before Tjn since otherwise MVCC-EOT’s validation algorithm (see Algorithm 6.9) or read

rule (see Algorithm 6.10) would be violated. Suppose further that Ti wr- or rw-depends

on Tjn which implies that the ordering relation c jn

relation ci

of Definition 9. Thus, Ti cannot get involved into a cycle when Tj0 rw-depends on it, i.e., 208 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

rw ? ? ?

the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be produced by MVCC-EOT even if Ti commits

before Tj0 .

(b) It remains to prove that Ti does not generate a cycle when Tj0 wr-depends on it. To do so,

suppose that Tj0 wr-depends on Ti which implies that the ordering relation ci

Suppose further that Tjn (either directly or indirectly) wr- or rw-depends on Tj0 which implies

that the ordering relation c j0

on Tjn which implies that the ordering relation c jn

the transitive nature of

once during its lifetime. Thus, the cycle hTi δ Tj0 δ ... δ Tjn δ Tii cannot be produced without violating the MVCC-EOT scheme and, therefore, we can conclude that MVCC-EOT produces

only serializable histories.



6.4 Performance-related Issues

In the subsequent subsection, we elaborate on some issues considerably influencing the perfor- mance of the MVCC-* schemes, namely data caching, intermittent network connectivity, and ex- ploiting semantic knowledge for concurrency control.

6.4.1 Caching

As all protocols of the MVCC-* suite rely on multi-versioning and thus require that multiple ver- sions of frequently modified objects are available in the system, a global version storage strategy needs to be evolved to address the issue. A simple, but (very) costly, strategy would be to main- tain all the versions of each database object somewhere in the system. However, such a storage policy could even exceed today’s primary and secondary storage capacities if we (not unrealis- tically) assume that object updates occur frequently. Fortunately, there is no need to maintain a 6.4. Performance-related Issues 209 complete history of all the generated versions of each database object to efficiently facilitate multi- version concurrency control. In order to provide optimal data support for any of the protocols in the MVCC-* suite, the system needs to maintain only those object versions that are at least as cur- rent as the versions that existed by the time the oldest active read-write transaction in the system started its execution. Even though such an approach prevents the number of versions from growing indefinitely, there is still no guarantee that the server is capable of storing all those versions. As a matter of fact, since read-write transactions are usually long running in nature as clients may be- come disconnected and experience long network propagation delays, the number of versions that need to be kept could still exceed the storage capacities of the server. Taking that into account, we suggest to impose a limit on the number of versions the server is allowed to maintain. Previously presented experimental results in Subsection 5.4.5.1 have shown that MVCC protocols designed for processing read-only transactions achieve good performance results if the server keeps up to five of the most recently installed versions of each database object. Since the MVCC protocols used in the simulation study presented in Subsection 5.4.5.1 impose similar data storage requirements on the database system as the MVCC-* schemes, we expect such a version limit to be practicable even for systems where read-write transactions are processed.

Apart from the fact that as many as possible requested object versions should be available in the system, another important performance issue is that the desired object version should be located as close as possible to the client application. Ideally, all database versions that are of potential use for the client would be stored within the client cache. However, the size of the client cache is typically much smaller than the size of the database, allowing the client to cache only a subset of the useful versions of the database. Therefore, the basic problem is to determine which versions of which objects should be cached at the client to achieve the best overall system performance and to decide what object version should be evicted once the client cache is full and a new version is requested. The latter issue is referred to as a cache replacement problem and can be resolved by deploying a judicious cache replacement policy that selects among the cached object versions the one with the lowest expected utility for the client in the future. As a matter of fact, finding optimal cache replacement victims in a multi-version dissemination-based environment is more complex 210 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly than in conventional distributed environments since the caching utility of an object version x j for a client running a transaction Ti depends on a number of factors, namely (a) the access recency and frequency of x in the recent past, (b) the update probability of x in the near future, (c) the version storage policy of the server, (d) the re-acquisition costs of x j once evicted from the client cache, and (e) the re-processing costs of any active transaction Ti that would occur when a fetch request for x j fails because the version has been evicted from the system and, therefore, the transaction needs to be aborted.

In Chapter 5 we introduced an efficient integrated cache replacement and prefetching policy, called MICP, which takes all those parameters into account when determining eviction victims.

However, MICP was designed to efficiently support read-only transactions and, therefore, might either not be suitable for read-write transactions or might need to be adapted to meet their specific requirements. If one considers the above parameters applied by MICP to determine replacement victims, it is obvious that they are not transaction type-specific and are therefore applicable for read-write transactions as well. Additionally, MICP’s underlying partitioned cache structure (see

Figure 5.2) suits read-write transactions as perfect as read-only ones as both types require cur- rent and non-current data objects for transaction processing and suffer heavily from object version misses. Remember that MICP tries to minimize the performance penalty produced by such cache misses by (a) allocating dedicated memory space to non-re-cacheable object versions to avoid them to compete with re-cacheable object versions for memory slots and (b) using a cost-based rather than probability-based metric to choose replacement victims. There is, however, one facet of MICP where its operations are not independent of the transaction type concerned, namely in part of the conditions for evicting useless object versions. Garbage collection of object versions is actually not only transaction type-dependent, but is even a matter of the CC scheme being used. As differ- ent protocols enforce dissimilar correctness and currency guarantees, the conditions for discarding object versions from the cache deviate from one protocol to another. MICP in this respect relies on a close cooperation with the client transaction manager whose task is to provide the information necessary to enforce the correct and instant eviction of locally cached object versions once they become useless. Due to this decoupling, MICP is suitable for virtually any transaction type and CC 6.4. Performance-related Issues 211 scheme and can therefore be used without any adaptation with the MVCC-* suite as well. Whether or not MICP is able to outperform other well-known caching and prefetching policies when be- ing used by clients processing read-write transactions rather than read-only ones, is examined in

Subsection 6.5.5.4.

6.4.2 Disconnections

For reasons of presentational convenience, the MVCC-* suite was introduced with the fundamental assumption that clients and the server are persistently connected. However, this does not match the reality of mobile computing. Mobile clients are typically only weakly connected to the server, i.e., there are periods where clients may be disconnected as well as other periods when data transfer between both parties can occur. Network disconnections can be either involuntary such as due to a network, client or server failure, or voluntary such as when the user temporarily unplugs its device from the network. Irrespective of the cause of the disconnection and the CC protocol deployed, dis- connection from the network has a detrimental effect on the client’s operations. (a) First, as clients do not perceive CCRs anymore, they cannot validate their ongoing transactions against recently committed ones and, therefore, may perform wasted work by continuing to process transactions destined to abort. (b) Second, since clients are unable to update their caches during disconnection periods, the currency of the client cache degrades and may subsequently cause stale cache reads.

(c) Third, and most important from the performance perspective, transaction processing may be hindered or even interrupted, e.g., if a requested object version is not cache-resident or a final com- mit message cannot instantly be transferred to the server. It is obvious that all of the aforementioned limitations undoubtedly restrict client transaction processing irrespective of the CC protocol being deployed. However, the impact of network disruptions on the protocols of the MVCC-* suite varies from one scheme to another.

In what follows, we discuss whether the protocols in the MVCC-* suite operate correctly under disconnections and if not, we propose measures to rectify the problem. As a matter of fact, clients do not receive CCRs when disconnected from the broadcast channel. Fortunately, with the exception of the MVCC-IBOT and MVCC-IBOTO protocols, the MVCC-* suite does not required CCRs for 212 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly protocol correctness reasons, but rather to guarantee short response times to applications using it.

The MVCC-IBOT and MVCC-IBOTO protocols, however, require those reports in order to figure out for each active read-write transaction Ti whether its RFP is completed or not and, if so, what is its RFSTS. Remember that the RFSTS is necessary for the scheduler of the both protocols in order to correctly map object read operations into respective version reads. Also, the RFSTS of an active read-write transaction Ti influences its conflict relationships with the other read-write transactions in the history MVH. Therefore, RFSTS(Ti) decides on Ti’s serialization order within MVH and hence, if chosen wrongly, may turn MVH into an unserializable history as the following example shows:

Example 10.

MVH9 = bc1 b1 r1[z0] r1[x0] b2 r2[x0] bc2 w1[z1] w2[x2] r2[y0] bc3 w2[y2] c2 bc4 r1[y2] c1

[x0  x2,y0  y2,z0  z1]

History MVH9 depicts the operations of two concurrent transactions T1 and T2 that run at clients

C1 and C2, respectively. As MVH9 shows, transaction T2 committed during MIBC 3 and hence, its

CC related information were broadcast at the beginning of MIBC 4. Now suppose that C1 lost its network connection from the broadcast channel between the middle of MIBC 3 and the end of

MIBC 4 and, therefore, missed the CCR containing T2’s CC information. By the time C1 recon- nects to the network, RFF(T1) is still set to true allowing T1 to read “forward” on recently commit- ted transactions. This, however, may actually result in an “incorrect” or unserializable history as illustrated in Figure 6.5. 

wr,ww T0 wr,ww rw T T 1 wr 2

Figure 6.5: Multi-version serialization graph of MVH9.

To protect MVCC-IBOT and MVCC-IBOTO from producing unserializable histories due to

CCR misses, there are basically two alternatives: (a) We could pessimistically assume that any active read-write transaction Ti becomes invalidated during a client’s disconnection period. To comply with this assumption, at disconnection time we could automatically set RFSTS(Ti) to the 6.4. Performance-related Issues 213 largest commit timestamp of any read-write transaction being reported to the client as having suc- cessfully completed its execution, which has the effect as if Ti had been invalidated just before the disconnection actually occurred. (b) We could optimistically assume that disconnection periods are of short duration and invalidations occur only infrequently. Consequently, Ti’s RFP does not need to be automatically ended by the time the disconnection actually takes place. Rather, upon recon- nection the client could determine whether Ti had been invalidated during the disconnection period and, if so, RFSTS(Ti) could be set accordingly. If we compare both alternatives, at first glance the latter alternative may appear to be more attractive since it prevents an active read-write transaction Ti from unnecessarily terminating its

RFP and guarantees that RFSTS(Ti) reflects Ti’s actual invalidation time. However, the optimistic approach has a serious disadvantage w.r.t. reconnections since it forces transaction processing to be blocked until every missed CCR has been processed by the client. As communication bandwidth in mobile environments is severely limited and round trip times (RTTs) are high compared to sta- tionary computing, transaction processing is expected to be interrupted for a significant amount of time. The pessimistic approach avoids such reconnection-induced waiting times at the cost of a potentially increasing number of transaction aborts caused by additionally occurring data conflicts between concurrent read-write transactions. Furthermore, this approach is much easier to imple- ment and has therefore been selected for our experimental study. Finally, we can conclude that disconnections may be useful in order to preserve scarce battery power of mobile devices. How- ever, as far as CC is concerned, they are rather unbeneficial causing the overall system performance to degrade. However, it is important to note that despite those drawbacks intermittent connectivity does not violate the correctness of our proposed protocols.

6.4.3 Conflict Reducing Techniques

In this subsection we will briefly sketch some methods aimed at reducing the number of data con-

flicts that occur when applying the MVCC-* suite. The basic idea behind those methods is to exploit semantic knowledge that can be derived from the objects being accessed and modified, from the ap- plication that operates on them, from the database structure and integrity constraints, and that can 214 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly be provided by the application programmer and application user in order to identify so-called false or permissible conflicts among those detected by the protocols’ transaction validation algorithm. In the following we examine methods for increasing the amount of permissive concurrency generated by the MVCC-* suite.

Remember that the MVCC-* suite has been designed to provide consistency management for general purpose database applications at relatively low implementational costs. To meet those pre- requisites, all the protocols of the MVCC-* suite operate on a relatively low level of abstraction, namely simple object reads and writes. As those operations do not capture any semantics of their higher-level operations or other valuable knowledge, conflict avoidance measures are limited to avoiding conflicts caused by grouping of information and by restricting the number of versions of the same data object stored in the system. False conflicts due to information grouping are relaxed since our protocols carry out concurrency control on an object rather than on a page basis. The underlying system architecture of the MVCC-* suite does not impose any fixed upper limit on the number of object versions the system is allowed to preserve simultaneously. Even though there is a physical limit on the number of versions that the server can maintain for each database object, there exists no system-wide upper limit on this number because clients may compensate for the version restriction of the server by storing non-re-cacheable objects in their local caches.

All of those conflict-reducing techniques can be extended by the following non-exclusive list of methods: The first approach to diminishing false conflicts is by specifying dependency relations among read and write operations of the same read-write transaction. One of the inherent prob- lems of the read/write model is that the transaction manager does not have any information as to whether some write operation wi[xi] depends on some earlier read operation, such as ri[x j] or ri[yk], invoked by the same read-write transaction Ti. As a consequence, the read/write model conserva- tively assumes that any object version written by a transaction Ti depends on the values of all object versions previously read. To eliminate transaction aborts due to that conservatism, the application programmer or skilled user could specify, on the basis of program-specific knowledge, for each write operation wi[xi] of Ti its dependency relation DR(xi) w.r.t. previous read operations. Under this concept, a read operation ri[x j] with ri[x j]

Example 11.

MVH10 = bc1 b1 r1[y0] r1[x0] b2 r2[y0] bc2 w1[x1] r2[x0] bc3 w2[y2] c2 bc4 c1

[DR(x1) = {x0},DR(y2) = {x0,y0},x0  x1,y0  y2]

History MVC10 indicates which read and write operations of transactions T1 and T2 are executed in what order. Both transactions modify the database state by updating objects x and y, respectively.

Additionally, two dependency relations are associated with MVC10 (added below MVC10 to the right of page in square brackets), indicating that object version x1 depends on the value of x0, whereas object version y2 depends on the values of x0 and y0. Because of that information, we know that

T1’s read operation r1[y0] does not contribute to T1’s write operation and, therefore, r1[y0] can be

0 conceptually treated as a separate transaction T1 that can be disentangled from the rest of transaction 0 T1. As we now have three separate transactions T1, T1, and T2, MVC10 can be serialized in the order 0 T1 → T2 → T1 which was not possible before splitting T1 into two transactions. 

Another technique for reducing the number of transaction aborts is to provide alternatives for all those write operations that are likely to cause conflicts with other concurrent transactions [126,

127, 150]. Such alternative writes can be used as fallback operations to resolve ww- or rw-conflicts detected by the client or server transaction manager. As an example illustrating the basic idea behind this concept, consider a mobile train ticket selling and reservation system:

Example 12.

In such a system, train passengers can reserve and buy tickets directly on the train. Suppose a cus- tomer is trying to book a window seat for the second class of the train going from Zurich to Milan leaving Zurich at 8:10 pm on the 11/10/2004. Further suppose that by the time the train guard enters the ticket request into the system, the mobile device running the ticket application is disconnected from the central database server. Additionally, assume the guard gets informed by the local system that the desired seat category was nearly sold out by the time the client got disconnected from the 216 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly server. Hence, the customer would be asked to provide alternative travel arrangements in order to prepare for the case that the intended booking fails. By providing alternative write operations to those specified in the initial booking transaction, the customer increases the likelihood of the trans- action being validated positively at the server since the transaction manager now has the opportunity to prevent the transaction from being aborted due to ww- or rw-conflicts with other concurrent, but previously committed, transactions. 

Last, but not least, it is important to note that this approach is especially attractive for operation- heavy and conflict-troubled transactions. However, as shown in Subsection 6.5.5.2, even short transactions can significantly benefit from this approach.

A further approach that has the potential to achieve a higher degree of concurrency of transac- tions is by analyzing higher-level application and database operations with the purpose of iden- tifying pairs of operations that either unconditionally commute or those that commute only in specific database states. The concept of commutativity is well-known to the database commu- nity [11, 25, 66, 140, 157, 158] and is therefore only briefly discussed in context of the MVCC-* suite. Since state-based commutativity is a generalization of unconditional commutativity, the for- mer is more representative and hence underlies the following discussion. In order to apply and decide on state-based commutativity, transactions and their statements are associated with pre- and post-conditions, i.e., a transaction Ti is represented as a triple of the form {CPre,i} Ti {CPost,i} and the j-th atomic statement Si, j of Ti is specified analogously. Thereby, pre-conditions are a set of as- sertions about expected state of database objects as well as program parameters that must be obeyed in order to guarantee the correct execution of the transaction/operation; and post-conditions are a set of conditions about the state of the database and the program after the transaction/operation has finished its execution. Now let us turn to an example illustrating the concept of state-based commutativity and its potential in the framework of the MVCC-* suite: 6.4. Performance-related Issues 217

Example 13.

Consider a mobile airport flight control and ticket sales database providing information to the ground personnel and to all the computerized staff at the airport. Suppose there exists a portable travel agency whose services can be utilized by each person at the airport that possesses the nec- essary equipment to communicate with its portable computer/PDA through the wireless airport network with the agency. The travel agency offers its customers the means to reserve and book so-called “last minute” flights. Suppose the travel database contains among other things the fol- lowing relation (the underlined attribute denotes the primary key of the relation): Flight-Leg(LegId,

AirplainId, Avail-Seats, Reserved-Seats). To achieve high response times, Flight-Leg is directly accessible and modifiable by all authorized customers of the airport travel agency. Now suppose that the following multi-level history has been produced by two agency customers C1 and C2 that have issued transaction T1 and T2, respectively:

Level 2 T1 T2

SELECT AirplainId, Avail−Seats SELECT AirplainId, Reserved− UPDATE Flight−Leg UPDATE Flight−Leg Level 1 FROM Flight−Leg Seats, Avail−Seats SET Avail−Seats=Avail−Seats − :x SET Reserved−Seats=Reserved−Seats + :y WHERE LegId=’B1234D’ FROM Flight−Leg WHERE LegId=’B1234D’ WHERE LegId=’B1234D’ WHERE LegId=’B1234D’

r [x ] Level 0 r1 [x0 ] 2 0 r1 [x0 ] w1 [x1 ] r2 [x0 ] w2 [x2 ]

Figure 6.6: Two level history showing lower and higher order operations of a ticket buying and ticket reservation transaction.

It is intuitively clear that the database is in a consistent state if the following conditions are satisfied:

I1 : Avail-Seats ≥ Reserved-Seats and I2 : Avail-Seats ≥ 0. Therefore, T1 and T2 can be character- ized by the triples {Avail-Seats − x ≥ 0 ∧ Temp = Avail-Seats} T1 {Avail-Seats = Temp − x} and

{Reserved-Seats + y ≤ Avail-Seats ∧ Temp = Reserved-Seats} T2 {Reserved-Seats = Temp + y}, respectively. For simplicity considerations, we do not define individual pre- and post-conditions for the SQL-statements of T1 and T2 and assume that the transactional conditions also apply to them. Now let us go back to the multi-level history in Figure 6.6. If one considers read and write op- erations at level 0 of the history, it becomes obvious that T1 and T2 are not serializable since the corresponding MVSG is cyclic. Despite this fact, the history is semantically correct in the sense 218 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly that its post-condition is the same as the post-condition of a serial history of transactions T1 and

T2 if we assume that the following pre-conditions were valid before both transactions started their execution: I3 : ((Avail-Seats − x) − (Reserved-Seats + y)) ≥ 0. If I3 holds at their starting points, it does not matter whether T1’s operations are executed before, during, or even after T2’s operations since I3 covers the pre-conditions of T1 and T2. This means that the history’s result is indepen- dent of the execution order of T1 and T2 and integrity constraints I1 and I2 will not be invalidated.

Consequently, T1 and T2 can be positively validated, i.e., accepted, at the clients and server. 

To sum up, we conclude that the three proposed techniques are actually compatible with each other in the sense that they are complementary rather than competitive. These techniques have the potential to achieve a higher degree of concurrency by either splitting original transactions into logically independent subunits, or by providing alternative write operations to those specified in the original transaction, or by specifying non-interference conditions for transactional operations.

6.5 Performance Evaluation

This subsection describes the experimental setup for evaluating the MVCC-* suite presented above.

The experiments were performed using a discrete event-driven simulator implemented with the

CSIM simulation package [136]. We opted for a simulation approach rather than a mathematical analysis since performance metrics such as throughput, abort rate, etc. are dependent on various parameters such as workloads, connectivity between clients and servers, cache replacement strat- egy, etc. which are not particularly amenable to analysis. The subsection is organized as follows: a description of the experimental setup including the system used and the workload model is given in Sections 6.5.1 and 6.5.2. In Section 6.5.3, we present the motivation behind choosing SI as a comparison scheme for the various protocols of the MVCC-* suite in the performance study. Sec- tions 6.5.4 and 6.5.5 present experimental results obtained by our simulation work. The results show that MVCC-IBOTO is superior to the protocols of the MVCC-* suite. The results also show that the cost of providing strong consistency (serializability) to read-write transactions is relatively high, i.e., it is much more expensive to provide serializability to mobile clients than weaker consistency 6.5. Performance Evaluation 219 guarantees such as SI, contradicting the results experimentally measured in the stationary client- server environment [9]. Results further show that MICP even outperforms LRFU-P when used as client cache replacement and prefetching policy to efficiently support read-write transactions and not only read-only transactions.

6.5.1 Simulator Model

We constructed the simulator and workloads by extending the settings used in the simulation studies presented in Chapters 4 and 5. These studies were conducted to evaluate the performance tradeoffs when providing various degrees of consistency and data currency to read-only transactions in mo- bile broadcast-based environments and to quantify the performance improvements achievable when using MICP as cache replacement and prefetching policy at mobile clients in lieu of other well- known policies. In order to provide means for comparing the performance results to those in our previous studies, we preserved key simulation parameters and only extended them where necessary.

The parameters of the study are listed in Tables 6.2 and 6.3. The simulation components will only be briefly discussed in the following since they are well-known from previous chapters.

Broadcast Server and Mobile Clients:

The core of the simulator consists of 10 mobile clients and a single broadcast/database server.

Client processors run at 100 MIPS while the server’s CPU has the power of 1,200 MIPS. These val- ues reflect typical processor speeds of mobile PDAs and high-performance workstations observed in production systems about three years ago. CPU costs are associated with the events listed in

Table 6.2. With respect to storage capacity, clients are diskless and have a relatively small memory cache capable of storing at most 2% of the objects maintained in the database. The client cache is modeled as a hybrid consisting of a small page cache (20% of the CCSize) and a large object cache

(80% of the CCSize). The page cache and the object cache are managed by the LRU and MICP-L policies, respectively. The server, on the other hand, is equipped with a relatively large memory cache that may store up to 20% of the database. Similar to the client cache, the server cache is partitioned into a page cache and an object cache. The page cache is managed using an LRU policy and the object cache is implemented as a mono-version cache maintained in FIFO order, i.e., the 220 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Server Database Parameters Parameter Parameter Value Database size (DBSize) 10,000 objects Object size (OBSize) 100 bytes Page size (PGSize) 4,096 bytes Server Cache Parameters Server buffer size (SBSize) 20% of DBSize Page buffer memory size 20% of SBSize Object buffer memory size 80% of SBSize Cache replacement policy LRU Server Disk Parameters Fixed disk setup costs 5,000 instr Rotational speed 10,000 RPM Media transfer rate 40.00 Mbps Average seek time (read) 4.5 ms Average rotational latency 3.0 ms Variable network costs 7 instr/byte Page fetch time 7.6 ms Disk array size 4 Client/Server CPU Parameters Client CPU speed 100 MIPS Server CPU speed 1,200 MIPS Client/Server page/object cache lookup costs 300 instr Client/Server page/object read costs 5,000 instr Register/Unregister a page/object copy 300 instr Register an object in prohibition list 300 instr Prohibition list lookup costs 300 instr Inter-transaction think time 50,000 instr Intra-transaction think time 5,000 instr

Table 6.2: Summary of the system parameter settings – I. object cache is treated similar to the MOB as proposed in [53]. Besides, the server has secondary storage which is modeled as a disk array consisting of 4 disks. Data pages are statically assigned to one of the available disks and each disk is modeled as FIFO queue scheduling operations in the order of their arrival. Disk parameters are listed in Table 6.2 and reflect typical transfer and access times of existing devices.

Database and Broadcast Program:

The simulated database consists of a set of 10,000 objects sized 100 bytes each. As each disk page is 4 KB in size, the database is stored on 250 disk pages. We selected a relatively small database size in order to make the simulations computationally feasible for today’s computer hardware. The broadcast program determines which objects of the database are how frequently 6.5. Performance Evaluation 221 disseminated to the client population. For reasons of simplicity, the program is modeled by a single broadcast disk, i.e., all data objects are disseminated with the same frequency. To account for the fact that data access in databases is typically skewed [68] and, therefore, access probabilities differ widely among objects, only the most popular 20% of the database are broadcast. Units of data transfer are disk pages, and in order to keep the size of an MBC short, the server broadcasts only the most recent version of any scheduled data object. The broadcast program is static in nature and is organized into 5 equally structured segments. Each segments consists of a data segment, a

(1,m) index [78], and a CCR as described in Table 6.2.1.

Network Model:

The network infrastructure of a complete hybrid data delivery environment consists of three communication paths: (a) a broadband broadcast channel, (b) multiple uplink channels, and (c) multiple downlink channels. The network parameters of the communication paths are modeled af- ter a real system such as Hughes Network System’s DirecPC1 [70]. The broadcast bandwidth is set to 12 Mbps and the unicast bandwidth is set to 400 Kbps downstream and to 19.2 Kbps upstream.

The unicast network is modeled as a FIFO queue and in order to model communication bandwidth restrictions, the number of uplink and downlink channels is limited to two communication links in each direction. Charged network costs consist of fixed and variable cost components and are levied for point-to-point messages only. To model the fact that mobile communication networks are inherently unreliable, thus communication links can be interrupted, we use an additional client pa- rameter termed disconnection probability. Thereby, a disconnection probability of zero means that clients do not suffer from intermittent connectivity and a probability of one indicates that network connection between the server and the clients cannot be established at all. In order to determine how the protocols perform under perfect network conditions, we experimented with a disconnec- tion probability of zero. However, we later changed this assumption in the sensitivity analysis of the simulator. 1DirecPC is now being called DIRECWAY [71]. 222 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Client Cache Parameters Parameter Parameter Value (Sensitivity Range) Client cache size (CCSize) 2% of DBSize Client page cache size 20% of CCSize Client object cache size (OCSize) 80% of CCSize Page cache replacement policy LRU Object cache replacement policy MICP (LRFU-P, P-P) REC size ≥ 50% of OCSize NON-REC size ≤ 50% of OCSize Aging factor α 0.7 Replacement policy control parameter λ 0.01 PCB calculation frequency 5 times per MIBC Broadcast Program Parameters Number of broadcast disks 1 Number of objects disseminated per MBC 20% (20 – 100%) of DBSize Number of index segments per MBC 5 Number of CCRs per MBC 5 Bucket size 4,096 bytes Bucket header size 96 bytes Index header size 96 bytes Index record size 12 bytes Object ID size 8 bytes Network Parameters Broadcast bandwidth 12 Mbps Downlink bandwidth 400 Kbps Uplink bandwidth 19.2 Kbps Fixed network costs 6,000 instr Variable network costs 7 instr/byte Propagation and queuing delay 300 ms Number of point-to-point uplink/downlink channels 2 Client disconnection probability 0% (10–50%) Client disconnection period 5 MIBCs

Table 6.3: Summary of the system parameter settings – II. 6.5. Performance Evaluation 223

6.5.2 Workload Model

The simulated workload is synthetically produced by two different workload generators. The work- load generators differ from each other in the way and in the place where they produce read-write transactions. In our simulator we use one generator that continuously produces read-write trans- actions at the server. Its purpose is to generate data contention within the system and hence, it indirectly controls the data conflict rate of concurrent transactions. In the standard setting of the simulator, the server workload generator issues two read-write transactions per MIBC and each transaction has a fixed length of 25 data operations. Objects read and written by those read-write transactions follow a Zipf distribution [168] with parameter θ = 0.80 and the write-read ratio, i.e., the number of writes versus reads, amounts to 1/4, which approximately reflects the average data access and update behavior of transactions in production systems [68]. The second workload gen- erator operates at mobile clients and differs from that at the server in two ways: (a) It produces read-write transactions of variable length from 5 to 25 data operations depending on the respective simulator setting with default transaction size of 10 data operations. (b) Data access and update op- erations of client transactions are slightly more skewed than those of server transactions and follow a Zipf distribution with parameter θ = 0.95. To produce resource contention in the network and at the server, the basic setup of the simulator imitates the activities of 10 clients. This number is then varied up to 50 clients in the sensitivity analysis. The data access behavior of read-write transactions that have to be aborted due to an irresolvable conflict is controlled by the abort variance parameter which is set to 100 percent, meaning that restarted transactions do not necessarily read or write the same set of objects as their original transactions. Such a parameter value stresses the system since re-issued transactions do not profit from caching operations of their initial transactions. Table 6.4 summarizes the workload parameters used in the experimental study.

6.5.3 Comparison with other CC Protocols and Integrating Conflict Prevention Measures into the MVCC-* Suite

In order to be able to evaluate the performance of our protocols in comparison to previously pro- posed ones, we decided, additionally to the algorithms of the MVCC-* suite, to implement the 224 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Workload Parameters Parameter Parameter Value (Sensitivity Range) Number of database servers 1 Number of clients 10 (10 – 50) Number of server-initiated read-write transactions per MIBC 2 Read-write transactions size (server) 25 operations Server data update pattern (Zipf distribution with θ) 0.95 Read-write transactions size (clients) 10 (5 – 25) operations Client data access pattern (Zipf distribution with θ) 0.80 Number of concurrent read-only transactions per client 1 Write-read ratio of client/server transactions 1/4 Abort variance 100% Uplink usage threshold 100%

Table 6.4: Summary of the workload parameter settings.

Snapshot Isolation (SI) scheme [23] into the simulator. We chose the SI scheme for comparison since it provides the same currency guarantees as the MVCC-BOT protocol, but ensures strictly weaker data consistency to transactions. As it does not avoid all known anomalies and may thus produce histories containing phantoms or write skews, SI does not guarantee serializability. How- ever, as this protocol avoids many anomalies and is nowadays implemented in many database prod- ucts such as Oracle 10g [82, 117] or PostgreSQL [57], it is attractive to benchmark our protocols against it.

Besides those comparisons, we wanted to quantify the performance impact of extending the protocols of the MVCC-* suite by some of the conflict reducing and resolving measures proposed in

Section 6.4.3. From the three proposed approaches, we selected one that is based on the prerequisite that clients provide alternatives for each intended write operation. We opted for the alternativity technique since there exists a wide spectrum of applications (e.g., think of sales, appointment, or procurement applications) where it can be applied, and it neither requires any severe adaptations of the MVCC-* suite nor does it have to be presented in the context of some application scenario. The alternativity technique has been integrated into the MVCC-* suite as follows: Whenever a write operation is performed by an active read-write transaction Ti, the user or, to be more exact, the workload generator randomly selects from a set of non-conflicting writes up to 3 alternative write operations for any of the original write actions of Ti. Those alternative operations will then be used whenever a rw-conflict between Ti and some validated read-write transaction Tj is detected. If any 6.5. Performance Evaluation 225 of those additionally provided write operations can resolve the conflict, processing of Ti continues.

Otherwise, Ti is aborted as usual.

6.5.4 Basic Experimental Results

We now present the results of the experiments run under the baseline settings of the simulator. Re- garding the statistical accuracy of all subsequently illustrated performance measures it is important to note that they lie within a 90% confidence interval with a relative error of ±5%. Figures 6.7(a) and 6.7(b) depict these results as the number of operations per transaction increases from 5 to 25.

Thereby, Figure 6.7(a) represents the throughput rate per second as a function of the transaction length for both the protocols of the MVCC-* suite and SI. Note that we do not show performance results for MVCC-IBOT and MVCC-BOT in Figure 6.7 since both protocols are inferior to their optimized variants. The results show that MVCC-IBOTOoutperforms MVCC-BOTO and MVCC-

EOT by about 31% and 83%, respectively, in the sense that the system performance would degrade by the specified percentage if MVCC-IBOTO would not be deployed as CC protocol. Additionally, it can be seen that SI’s performance is superior to that of all the protocols of the MVCC-* suite.

Figure 6.7(b), in its turn, shows the relative performance difference between SI and the MVCC-* suite. On average, the performance penalty relative to SI is about 40% for the best performing

MVCC-* protocol, i.e., the penalty of providing serializability to mobile transactions is significant.

The plots also show that the penalty of ensuring serializability rises with increasing transaction length since more transactions need to be restarted by the protocols of the MVCC-* suite relative to SI. It is interesting to note that the results visualized in Figure 6.7 diverge significantly from those experimentally investigated for stationary environments. In [9] the performance degradation from providing serializability (IL 3) relative to protocols with lower consistency guarantees (IL 2) was examined for a conventional client-server database system where communication is carried out through a high bandwidth, reliable, and low latency network. There, the experimental results show that serializability can be achieved by a performance penalty of only 1% to 9% relative to protocols that ensure IL 2 data consistency guarantees. Consequently, the author concludes that it is not worthwhile to execute transactions under weaker consistency levels than serializability since 226 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly lower levels do not exclude the possibility of violating the integrity of the database. Despite signif- icantly higher serialization costs experienced in mobile environments, we still believe that clients

(whether mobile or stationary) should not trade-off data consistency for performance improvements by means of violating database integrity constraints.

6

¡

SI 120 MVCC−IBOTO ¡

MVCC-IBOTO O O MVCC−BOT 5 MVCC-BOT MVCC−EOT

MVCC-EOT 100

 



 



 

4 

 

 

 

80 

 

 

 



 

 

 



¨ ¨

© ©

 

 

   

 

¨ ¨ ©

3 ©

 

 

   

 

¨ ¨

©

60 ©

 

 

   

 

¨ ¨

© ©

 

 

   

 

¨ ¨

© ©

¦

§

 

 

   

 

¨ ¨ ©

2 ©





¦

§

 

 

   

 

¨ ¨ ©

40 ©





¦

§

 

 

Throughput / Second

   

 

¨ ¨

© ©





¦

§

 

 

   

 

¨ ¨

© ©





¦

§

 

 

   

 

¨ ¨

© © 

1 





¦

§

¤

¥

 

 

   

 

¨ ¨

© ©

RPP compared to SI (Percent) 20









¦

§

¤

¥

 

 

   

 

¨ ¨

© ©



£







¦

§

¤

¥

 

 

   

 

¨ ¨

© ©



¢£







¦

§

¤

¥

 

 

   

 

¨ ¨

© ©



¢

 

0  5 10 15 20 25 0 5 10 15 20 25 Transaction Length Transaction Length (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.7: Transaction throughput and relative performance penalty (RPP) of the MVCC-* suite compared to SI as the transaction size is increased.

Additionally, we conducted experiments in order to quantify the degradation of the overall system performance that occurs when using MVCC-BOT and MVCC-IBOT in lieu of their opti- mized companions. The results of the experiments are summarized in Figure 6.8 showing that the penalty of using the unoptimized protocols grows with increasing transaction length. On average, the performance difference of MVCC-IBOT and MVCC-BOT is on average about 8% relative to the optimized versions. Since the measured performance degradation due to inefficient transaction validation is quite significant, we believe that the additional processing overhead of the optimized protocols compared to their basic variants is well compensated by the performance gain.

6.5.5 Results of the Sensitivity Analysis

In the following subsections we present the results of a sensitivity analysis conducted to understand how different system and workload parameters affect the overall system performance of the proto- cols of the MVCC-* suite in comparison to each other and to SI. We also report on the performance 6.5. Performance Evaluation 227

16

 

 

¡ ¡ ¡

  

MVCC−IBOT 

 

  

MVCC−BOT 

14

 

 





 

 





 

 





  

12 

¨

©





 

 

¨

©





 

 

¨

©





 

  ¨

10 ©





 

 

¨

©





 

 

¨

©





 

  ¨

8 ©





 

 

¨

©

¦ ¦

§





 

 

¨

©

¦ ¦

 § 



 



 

 

¨

©

¦ ¦

 § 



 

6 

 

 

¨

©

¦ ¦

 § 



 



RPP comp. to the

 

 

¨

©

¦ ¦

 § 



 



 

 

¨

©





¦ ¦

 § 



 



4

 

 

¨

©





¦ ¦

 § 



 



 

 

¤ ¤

¥ ¥

¨

©





¦ ¦

 § 



  

Optimized Protocols (Percent)

¢ ¢

£ £

 

 

¤ ¤

¥ ¥

¨

©





¦ ¦

 § 



 



¢ ¢

£ £

 



2 

¤ ¤

¥ ¥

¨

©









¦ ¦

 § 



 



¢ ¢

£ £

 

 

¤ ¤

¥ ¥

¨

©









¦ ¦

 § 



 



¢ ¢

£ £

 

 

¤ ¤

¥ ¥

¨

©



  0  5 10 15 20 25 Transaction Length Figure 6.8: Relative performance penalty of deploying MVCC-BOT and MVCC-IBOT, respectively, in lieu of their optimized variants. impact on the protocols of the MVCC-* suite and SI when providing alternative write operations to those specified in the original transaction. Last, but not least, we present results of a perfor- mance analysis comparing the cache replacement and prefetching policy MICP-L, used to improve response times of read-write transactions, vs. LRFU-P and P-P.

6.5.5.1 Effects of Varying the Data Contention Level

To understand the impact of data contention on the protocols of the MVCC-* suite as well as the SI scheme, we varied the number of update transactions generated per MBC by the server workload generator. Remember, in the default setting of the simulator the generator produced 10 read-write transactions per MBC. For the sensitivity experiments, we varied the number of read- write transactions issued by the server starting from 5 up to 25 transactions per MBC. As the results in Figures 6.9(a) and 6.9(b) show, the performance of the MVCC-* schemes compared to SI degrades both in terms of absolute numbers and relative percentages. The reason is that if there are only few server transactions executing in parallel to the transactions run at the clients, the data conflict rate is relatively low and, therefore, transaction restarts are rare. However, if we increase the number of concurrently active transactions gradually, the transaction abort rate will grow superlinearly causing the overall system performance to degrade at the same rate. Besides, the MVCC-* schemes suffer from higher data contention levels to a greater extent than SI due to 228 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly the relatively faster increasing probability of transaction aborts for the former.

2.5 120

¡ SI O ¡

O MVCC−IBOT

MVCC-IBOT O O MVCC−BOT 2 MVCC-BOT 100 MVCC−EOT MVCC-EOT

80 1.5

60

 

 

 



 

 

 

1 

 

   







 



 

   







 



 

   



40 







 



 

Throughput / Second

   







¨ ¨

 © ©



 



¦

§

 

   







¨ ¨

 © ©



 



¦

0.5 §

¤

¥

 

   







¨ ¨

 © ©



 



¦

§

¤

¥

 

   







¨ ¨

 © ©



 

20 

RPP compared to SI (Percent)

¦

§

¤

¥

 

   







¨ ¨

 © ©



 



¢

£

¦

§

¤

¥

 

   







¨ ¨

 © ©



 



¢

£

¦

§

¤

¥

 

   







¨ ¨

 © ©



 



¢

£ ¦

0 § 5 10 15 20 25 0 5 10 15 20 25 Number of Update Transactions per MBC Number of Update Transactions per MBC (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.9: Absolute and relative transaction throughput by varying the number of update transactions issued per MBC.

6.5.5.2 Effects of Providing Alternative Write Operations to Transactions

The MVCC-* suite was designed to provide serializability and well-defined data currency guaran- tees to general purpose applications without having to exploit any application or user knowledge for carrying out CC. In Section 6.4.3, we described that the technique of using alternatively spec- ified write operations to prevent active read-write transactions from experiencing data conflicts with recently committed transactions has the potential of achieving a higher degree of concurrency in the system. However, the method has the fundamental disadvantage of undermining the user- friendliness of the application since it is the user that is expected to specify write alternatives for transactions. In order to be able to better evaluate the attractiveness of this technique, we run ex- periments quantifying its impact on the overall system performance. For this purpose, we varied the number of alternative write operations provided for each transactional object write from 0 to 3.

The results of the experimental studies are represented in Figures 6.10(a) and 6.10(b), respectively.

They show a notable performance improvement of at least 40% for the investigated protocols if just one additional write operation is associated with each original write operation. Additionally, the plots show that providing more than one additional write operation for each original one results 6.5. Performance Evaluation 229 in a sublinear performance increase. Nonetheless, transaction throughput nearly doubles (irrespec- tive of the investigated protocol) if each original write operation is backed up by three additional ones. The reader may wonder why we do not present experimental results for the MVCC-EOT protocol. The reason is that the scheme does not benefit from this conflict-reducing technique since rw-conflicts between a validating transaction Ti and some read-write transaction Tj ∈ Tactive(Ti) do not matter under the MVCC-EOT protocol. As the scheme enforces EOT data currency guarantees, a validating transaction Ti will always be serialized after all previously committed transactions and, therefore, rw-conflicts are not an issue.

4 100

¡

SI ¡ 

SI 

¨ ¨ ©

O © 

MVCC-IBOT MVCC−IBOTO  

3.5 O 90  ¦

O ¦

MVCC-BOT § 

MVCC−BOT 

¦ ¦

§





¦ ¦

§





¦ ¦

80 § 

3 

¦ ¦

§





¦ ¦

§





¦ ¦

§ 

70 

¦ ¦

§





¦ ¦

2.5 §

¤

¥





¦ ¦

§

¤

¥





¦ ¦

§

¤

¥ 

60 

¦ ¦

§

¤

¥





¦ ¦

§

2 ¤

¥





¦ ¦

§

Improvement in

¤

¥

 

Throughput / Second

¦ ¦

50 §

¤

¥





¦ ¦

§

¤

¥





¦ ¦

§ ¤

1.5 ¥





¦ ¦

¢ §

£

¤

¥





40

¦ ¦

¢ §

£

¤

¥

 

Transaction Throughput (Percent)

¦ ¦

¢ §

£

¤

¥





¦ ¦

¢ §

£

¤

¥  1  0 1 2 3 30 1 2 3 Number of Alternative Write Operations Number of Alternative Write Operations (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.10: Absolute and relative performance improvements by providing alternative write operations.

6.5.5.3 Effects of Intermittent Connectivity

Mobile clients may either voluntarily or involuntarily get disconnected from the hybrid data delivery network. In order to determine the effect of disconnections on the overall system performance, we have run the simulator under two connectivity setups: (a) First, we simulated the case of mobile clients being partially disconnected from the network for fixed periods of time. By the notion of partial disconnections we mean that clients’ networking capabilities are restricted to accessing and downloading data from the broadcast channel, i.e., if clients operate in partial disconnection mode, there is no means for them to communicate to the server through a point-to-point channel. (b) 230 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Second, we simulated the case of communication between clients and the server being completely interrupted for fixed time intervals.

The results of periodically interrupting the point-to-point communication between mobile clients and the server are presented in Figures 6.11(a) and 6.11(b), respectively. The plots show the performance degradation of the investigated protocols as a function of the disconnection prob- ability. Remember that the disconnection probability specifies the likelihood of client’s inability to communicate over one or more communication medias. As the results show, the performance penalty experienced by clients due to partial disconnections is negligible if disconnections occur relatively infrequent, i.e., up to 10% of the overall simulation time. Otherwise, the performance de- grades moderately by about 8 to 14% if the point-to-point communication is interrupted for about a quarter to a half of the clients’ total processing time. The reason for that relatively small perfor- mance drop due to unreliability of the back-channel to the server is twofold: (a) The majority of the client data requests can either be satisfied from the client cache or from the broadcast channel, i.e., the uplink communication channel plays only a tangential role for satisfying data requests. (b)

Since transactions are long-lived due to high RTTs over wireless channels and are therefore likely to experience many data conflicts, the number of transaction pre-commits is relatively low and thus,

the back-channel is only seldom required to initiate final transaction validations.

¡

2 ¡

¡ ¡

¨ ¨ ©

14 SI ©

¨ ¨ © 1.8 MVCC−IBOTO ©

MVCC−BOTO 

12 

¦

§ 

1.6  

MVCC−EOT 

¦

§









¦

§









¦

1.4 §

 

10 



¦

§









¦

§







1.2 



¦

§









8 



¦

§













¦

§









¤

1 ¥





¦

§









¤

¥

SI  

¦

§ 

6 





¤

O ¥





¦

§



 

0.8 MVCC-IBOT 

¤

¥ 

O 

¦

§









¤

MVCC-BOT ¥ 

Throughput / Second



¦

§









¤

4 ¥

MVCC-EOT 

 ¦

0.6 §









¤

¥





¦

§









¤

¥





¦

§





¢

£





¤

¥

0.4 

2 

¦

§





¢

£





¤

¥





¦

§





¢

£





¤

¥



 

¦

§





¢

£





¤

0.2 ¥ 0 0.1 0.2 0.3 0.4 0.5 RPP of Intermittent Connectivity (Percent) 0 0 0.1 0.25 0.5 Disconnection Probability Disconnection Probability (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.11: Absolute and relative performance degradation with increasing disconnection probability (par- tial disconnection). 6.5. Performance Evaluation 231

The results of the experiments simulating the scenario where the clients are detached from both communication channels while being in disconnected mode are shown in Figures 6.12(a) and 6.12(b). As before, increasing the disconnection probability causes the transaction through- put rate to drop. However, in contrast to the previous experiments, the decline in the overall system performance becomes even more significant in cases of rare disconnections. For example, if the probability of a client being separated from the mobile network is 10%, the system performance degrades from 24 to 42% depending on the protocol used. Further, the plots show that the perfor- mance decline accelerates with increasing disconnection probability which is explained by the fact that clients operating in total disconnection mode may miss CCRs desirable to ensure cache consis- tency and freshness and to pre-validate active read-write transactions. Another major drawback of total disconnections is that they hinder clients from acquiring non-cache-resident object versions, i.e., whenever a cache miss occurs, transaction processing is blocked until the client reconnects to the server. Depending on the frequency and duration of disconnection periods, transaction process- ing is impeded significantly as Figure 6.12 clearly demonstrates.

2 100





¡ ¡

SI ¡



 

O SI 

¨ ¨ ¨

 © ©

1.8 © 

MVCC-IBOT   O 

O 90 MVCC−IBOT 

¦ ¦

§



 

MVCC-BOT 



 

O

¦ ¦

§





   

1.6 MVCC-EOT MVCC−BOT  

  

¦ ¦

§





    

 

MVCC−EOT 

¦ ¦

§



80 

 

  

1.4 

¦ ¦

§





 

 





¦ ¦

§





 

 





¦ ¦

§







70  

1.2 





¦ ¦

§





 

 





¦ ¦

§





 

¤ ¤

¥

  

1 

¦ ¦

§





 

¤ ¤

¥ 

60 





¦ ¦

§





 

¤ ¤

¥

 





¦ ¦

§





 

¤ ¤

¥ 

0.8 





¦ ¦

§





 

¤ ¤

¥ 

50 





¦ ¦

§





 

¤ ¤

¥

 



0.6 

¦ ¦

§





 

¤ ¤

¥

 

Throughput / Second





¦ ¦

§





 

¤ ¤

¥

 



40 

¦ ¦

§





 

¤ ¤

¥ 

0.4 





¦ ¦

§





 

¤ ¤

¥

 





¦ ¦

§





 

¤ ¤

¥

 

¢

£ 

0.2 

¦ ¦

30 §





 

¤ ¤

¥

 

¢

£







¦ ¦

§





 

¤ ¤

¥

  

¢

£







¦ ¦

§





 

¤ ¤

0 ¥  0 0.1 0.2 0.3 0.4 0.5 RPP of Intermittent Connectivity (Percent) 20 0.1 0.25 0.5 Disconnection Probability Disconnection Probability (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.12: Absolute and relative performance degradation with increasing disconnection probability (total disconnection). 232 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

6.5.5.4 Effects of Using Various Caching and Prefetching Policies

Our last experiment was aimed at proving the performance superiority of MICP-L over LRFU-

P, which is a modified variant of the LRFU cache replacement policy [98, 99], used in order to improve the response times of read-write transactions. Remember, MICP-L’s original goal was to improve transaction response times of read-only transactions. In Chapter 5 we showed that the performance degradation of using LRFU-P is about 19% vs. MICP-L when used to meet the data storage requirements of read-only transactions. Since the CC protocols proposed for read-only transactions [123,124,138,139] have similar requirements on the client cache manager than those of the MVCC-* schemes2, we expect MICP-L to outperform LRFU-L if used along with the MVCC-* suite.

In the following we present the experimental results combining the MVCC-IBOTO protocol with various cache replacement and prefetching policies such as P-P, LRFU-P, and MICP-L. We chose

MVCC-IBOTO as CC protocol for these experiments since it turned out to be the best performing protocol among those of the MVCC-* suite. The reason for selecting P-P in addition to LRFU-

P and MICP-L is that the P-P protocol’s characteristics such as perfect knowledge of the access probabilities of all database objects and ease of implementation allows us to use its experimental results as a baseline for the comparison with the other two protocols. Further, to be able to compare the results with those measured for read-only transactions in Chapter 5, we run our simulator with the same system and workload settings as used there. To do so, we increased the number of data updates produced by the server workload generator from 50 to 100 per MBC. The results of the experiments are graphically shown in Figures 6.13(a) and 6.13(b). As in our previously reported experiments (see Section 5.4.4), P-P significantly outperforms both online cache replacement and prefetching policies MICP-L and LRFU-P. However, and more importantly, MICP-L is also supe- rior over LRFU-P if used to accelerate the response times of read-write transactions. On average, relative performance degradation when deploying LRFU-P in lieu of MICP-L is about 6%. The drop in the performance advantage of MICP-L vs. LRFU-P if used for read-write transactions and

2To achieve an optimal level of concurrency all aforementioned protocols exploit multi-versioning and, therefore, expect the cache manager to maintain non-current as well as current object versions in the client cache. 6.5. Performance Evaluation 233 not for read-only transactions is due to the fact that read-write transactions do not benefit to the same extent as their read-only counterparts from observing non-current data objects. Since MICP-L and

LRFU-P primarily differ from each other in the way they handle those object versions, MICP-L’s superiority over LRFU-P diminishes somewhat. In order to enable the reader

adjusted the scaling of the plots presented in Section 5.4.4 and present them in Figures 6.14(a)

and 6.14(b) once more.

¡ ¡

5 ¡

¡ ¡

P-P ¡

MICP−L

50 MICP-L

LRFU−P

LRFU-P

4

40

3

30

¨ ¨

©

¨ ¨

2 ©

¦

§

¨ ¨

© ¦

20 §

Throughput / Second

¨ ¨

©

¦

§

¨ ¨

©

¦

§

¨ ¨ ©

RPP to P−P (Percent)

¦

§

¨ ¨

1 ©

¦

§

¨ ¨

©

¤ ¤ ¥

10 ¥

¦

§

¨ ¨

©

¤ ¤

¥ ¥

¦

§

£

¨ ¨

©

¤ ¤

¥ ¥

¦

§

¢£

¨ ¨

©

¤ ¤

¥ ¥

¦

§

¢

¨ ¨ 0 © 5 10 15 20 25 0 5 10 15 20 25 Transaction Length Transaction Length (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.13: Absolute and relative performance deviation between P-P, MICP-L, and LRFU-P when varying the read-write transaction size.

20 55

¡ ¡ P-P ¡ MICP−L MICP-L 50 LRFU-P LRFU−P 16 45

40

12

35

¨ ¨

30 ©

¨ ¨

©

¨ ¨

8 ©

25

¨ ¨

©

¨ ¨

©

¦

§

¨ ¨ ©

Throughput / Second 20

¦

§

¨ ¨ ©

RPP to P−P (Percent)

¦

§

¨ ¨

© ¤

4 ¤

¥ ¥

¦

§

¨

15 ¨

©

¤ ¤

¥ ¥

¦

§

¨ ¨

©

¤ ¤

¥ ¥

¦

§

¨ ¨

©

¤ ¤

¥ ¥ ¢

10 £

¦

§

¨ ¨

©

¤ ¤

¥ ¥

¢

£

¦

§

¨ ¨

©

¤ ¤ ¥

0 ¥ ¢ 5 10 15 20 25 5 £ 5 10 15 20 25 Transaction Length Transaction Length (a) Absolute throughput performance (b) Relative throughput performance

Figure 6.14: Absolute and relative performance deviation between P-P, MICP-L, and LRFU-P when varying the read-only transaction size. 234 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

In preceding sections of this chapter we have given detailed description of a suite of MVCC protocols ideally applicable in hybrid data delivery environments. Besides, we have provided the correctness proofs of the protocols, have shown their performance under various system settings and workloads, have evaluated them against the SI protocol, have given an indication on how their performance could be improved by exploiting the semantics available from the user and/or appli- cation, and finally have investigated MICP-L’s performance when servicing clients that execute read-write transactions. The quantitative performance analysis has shown that MVCC-IBOTO is the best performing protocol in the MVCC-* suite. Performance-wise MVCC-IBOTO is followed by

MVCC-BOTO and MVCC-EOT, which indicates MVCC protocols’ superiority over mono-version schemes in mobile environments. Additionally, comparing and contrasting the performance results of MVCC-IBOTO and MVCC-BOTO demonstrates that forcing read-write transactions to read ob- ject versions that were current at the transaction’s starting point is not the optimal strategy when application responsiveness is to be maximized. Instead, when mapping read operations to actual version reads, the read forward policy should be applied which allows reads of object versions writ- ten after the starting point of a read-write transaction Ti as long as Ti has not been invalidated by any concurrent read-write transaction Tj. It is also important to note that despite the underperformance of the other protocols of the MVCC-* suite compared to MVCC-IBOTO, their existence is fully jus- tified. Since both MVCC-BOTO and MVCC-EOT provide data currency guarantees to read-write transactions different to those of MVCC-IBOTO and each of those degrees may be desirable in one or the other application scenario, all these protocols are useful for CC purposes.

To provide some assistance when selecting the most appropriate CC scheme out of the MVCC-* suite, we summarized the protocols’ major characteristics in Table 6.5, which contains the proto- cols’ features described from the perspective of a client processing a read-write transaction Ti. In summary, the information presented in Section 6.5 shows that MVCC-EOT is the cheapest pro- tocol in the MVCC-* suite in terms of both space and processing overhead. However, it clearly suffers from its weak performance results and, therefore, should only be used in situations where

EOT data currency guarantees are necessary for correctness reasons. Overhead-wise MVCC-EOT is followed by MVCC-IBOT and its optimized variant MVCC-IBOTO. In contrast to MVCC-EOT, 6.5. Performance Evaluation 235 however, MVCC-IBOT and MVCC-IBOTO outperform the latter significantly. As MVCC-IBOT and MVCC-IBOTO perform better than MVCC-BOT and its optimized variant, and as the latter pair incurs the highest storage and processing costs, MVCC-IBOT and MVCC-IBOTO are the first choice if the overall system performance is to be maximized. MVCC-BOT or MVCC-BOTO may be applied if BOT data currency requirements are imperative from the application point of view.

Besides, if application response time provided by any of the proposed protocols is not satisfac- tory from the user’s point of view, the protocols can be extended by a number of measures such as associating dependency relations to transactions, providing alternative write operations to those specified in the original transaction, etc. The implication of exploiting such techniques for trans- action processing has been quantified through simulation and the results have clearly proven their attractiveness for mobile computing. However, when deploying such measures, one has always to remember that they are application-dependent, that their use is prone to error and they complicate application programming and may deteriorate the application’s user-friendliness. 236 Chapter 6. Processing Read-Write Transactions Efficiently and Correctly

Storage Performance Processing space penalty Data currency overhead Influence of Protocol overhead relative to guarantee at the disconnections at the MVCC- client client IBOTO no influence on database state as protocol MVCC- of the moderate moderate correctness, but on 36% BOT transaction’s transaction starting point throughput moderate, but higher MVCC- see MVCC-BOT see MVCC-BOT moderate than 31% BOTO protocol protocol MVCC- BOT if RFF(T ) is set to low, i database state false, same if RFF(T ) between the i influence as on MVCC- is set to transaction’s moderate MVCC-BOT; 8% IBOT true; starting and otherwise, it otherwise, commit point requires T to end its moderate i RFP low, if RFF(Ti) MVCC- see MVCC-IBOT is set to see MVCC-IBOT moderate – IBOTO protocol true; protocol otherwise, moderate database state as MVCC- of the see MVCC-BOT low low 83% EOT transaction’s protocol commit point

Table 6.5: The MVCC-* suite at a glance. “All truths are easy to understand once

they are discovered; the point is to dis-

cover them.”

– Galileo Galilei

Chapter 7

Conclusion and Future Work

In this thesis, we have focused on the problem of efficiently providing consistent and current data to dissemination-based applications run at clients being part of a hybrid data delivery network. In this last chapter, we summarize all the results presented. We then conclude with a discussion of the various directions to extend this work.

7.1 Summary and Conclusion

Owing to the widespread deployment of wireless networks and ever-increasing capabilities of mo- bile devices, wireless data services quickly emerge as data-hungry users require instant access to timely information no matter where they are located [71, 79, 108]. Due to the intrinsic constraints of mobile systems such as asymmetric bandwidth, limited power supply and unreliable commu- nication, the efficient and cost effective provision of wireless data services poses many research challenges in itself. One of the most important issues is to provide data consistency and currency to dissemination-based applications, and this topic has been intensively discussed within this thesis.

The thesis first presented background information on the basic concepts of wireless data com- munications, highlighted the characteristics, capabilities, and limitations of existing and newly

237 238 Chapter 7. Conclusion and Future Work emerging wireless communication networks and discussed the various forms of asymmetry that occur in mobile data networks. We then enumerated the various limitations of mobile computing and discussed their influence on our objective to efficiently provide data consistency and currency to information-centered applications despite frequent updates of the data source. Thereafter, we proposed hybrid data delivery as the basis of providing highly scalable and efficient transaction support to dissemination-based applications and presented its potential performance and scalabil- ity benefits in contrast to its underlying basic data delivery mechanisms, namely the traditional request/response (or pull/unicast) and the rather novel push/broadcast. It was followed by a discus- sion of various performance-critical and other crucial issues — besides transaction support — that are vital to the successful deployment of hybrid data delivery services. In this context, we focused on the air-cache which serves as an abstract vehicle or intermediate memory level between the mobile clients and the server, identified its special properties compared to other types of caching, presented different ways to organize the air-cache and discussed their advantages and disadvantages w.r.t. the critical issue of providing access efficiency. We identified power conservation as a sec- ond system-critical design component for hybrid data delivery networks and introduced air-cache indexing as a solution to the problem of reducing the energy consumption of mobile devices when locating and retrieving requested data objects in the air-cache. We distinguished three classes of indexing: (a) signature-based, (b) hashing-based, and (c) tree-based indexing, described their basic working principles and reported on results of two comparison studies that quantitatively evaluated the performance of various instances of the three classes. As a result of the trade-off between tuning time and access latency (and as reported by the performance studies), none of the three indexing methods is superior to any of the other in terms of both performance metrics. The results, however, showed that if the application scenario favors short latency at the cost of more energy consump- tion, a signature-based or hashing-based indexing method is the way to go. Otherwise, a tree-based indexing method should be deployed.

Following those preliminary discussions, the thesis then focused on the main problem of this work, the cost-efficient and adequate provision of data consistency and data currency to dissemination-based applications. We first concentrated on the provision of efficient and reliable 7.1. Summary and Conclusion 239 data consistency and data currency support for queries, i.e., read-only transactions, as they consti- tute the majority of the transactions initiated by dissemination-based applications. We addressed the limitations of existing IL definitions by showing that most of them lack any data currency guaran- tees. To rectify the problem, we proposed four new ILs, namely BOT Serializability, Strict Forward

BOT Serializability, Strict Forward BOT Update Serializability, and Strict Forward BOT View Con- sistency, that provide a set of useful data consistency and currency guarantees to dissemination- based applications. In contrast to the ANSI ILs, our specifications of the proposed levels are implementation-independent and use a combination of conditions on serialization graphs and trans- action histories. Furthermore, we presented new and efficient implementations of the newly defined

ILs based on optimism and multi-versioning. We also presented the results of a simulation study that evaluated the relative performance of the different ILs’ implementations and additionally compared their performance with previously proposed schemes. The results showed that the cost of providing

Full Serializability to read-only transaction compared to View Consistency, which is the weakest consistency level that ensures a transaction-consistent view of the database, is relatively low rang- ing from as little as 1% up to 10%. Thus, if the application writer is in doubt whether running a read-only transaction at a weaker IL would produce anomalous reads that may result in false or misleading decisions, then serializability is preferable over any other level. Further, we conducted a comparison study of our worst and best performing ILs’ implementations, namely MVCC-SFBVC and MVCC-BS, with the invalidation-only and F-MATRIX-No schemes [124, 139]. The results showed that MVCC-SFBVC and MVCC-BS are both superior to the other two protocols which is a result of the strong data currency guarantees that the latter two enforce, obliging the scheduler to produce only mono-version histories.

As a second major topic we tackled the issues of client cache management and data prefetch- ing as they are fundamental techniques to improve the performance and scalability of hybrid data delivery systems. We started by emphasizing that currently available client caching and prefetching policies either designed for the stationary or for the mobile client-server architecture are not ef- fective in supporting the data preservation and storage requirements imposed by MVCC protocols suitable for read-only transactions. The reason for their inadequacy is that they treat all versions 240 Chapter 7. Conclusion and Future Work of an object the same way, ignoring the fact that two distinct versions of the same object may have different values to the client. To address this shortcoming, we proposed a novel multi-version in- tegrated cache replacement and prefetching algorithm, called MICP. MICP logically divides the available client cache size into two variable-sized partitions, coined REC and NON-REC, in order to separate re-cacheable from non-re-cacheable object versions and to prevent that re-cacheable and non-re-cacheable versions compete among each other for scarce storage space and that non-re- cacheable versions cannot be replaced by re-cacheable ones. In contrast to traditional caching and prefetching policies, MICP does not only consider the access probability of an object to determine whether it should be evicted from or pre-fetched into the client cache, but also its re-acquisition costs and liklihood that it can be re-cached. MICP combines estimates of those three parameters into a single performance metric, called probabilistic cost/benefit value (PCB), and calculates it for any cache-resident object version whenever a demand-fetched or prefetched object version is to be brought into a full cache. Then, an existing cached object version must be chosen as a replacement victim and MICP does so by selecting the cached version with the lowest PCB value. As PCB values are dynamic — they change with every “tick” of the broadcast — prefetching from the air- cache is potentially very expensive to implement. MICP solves this problem by calculating PCB values only for a small subset of the potential prefetching candidates, namely versions of recently referenced objects and makes a prefetching decision only if a version of a recently referenced ob- ject is broadcast. To gain insight into the algorithms’ performance, we validated MICP or, more precisely, MICP-L which is a lightweight version of MICP that calculates PCB values of cached object versions only at pre-defined events rather than at every broadcast tick, with experimental results drawn from a highly-detailed simulation model. The results demonstrate that the perfor- mance penalty of using LRFU-P or W 2R-P as cache replacement and prefetching policy compared to MICP-L is, on average, about 19% and 80%, respectively, and MICP-L’s cache hit rate is about

6% and 94% higher than that of LRFU-P and W 2R-P, respectively. As MICP-L significantly out- performs LRFU-P which extends LRFU by prefetching from the air-cache and LRFU is currently known to be the best performing online caching algorithm, we can deduce that MICP-L is able to give dissemination-based applications issuing read-only transactions a much higher improvement 7.1. Summary and Conclusion 241 in response time than any other proposed caching and prefetching policy.

Last, but not least, we addressed the issue of providing data consistency and currency along with good performance to broadcast-based applications that do not only want to access/consume shared information, but also want to produce and integrate them into a universally accessible database.

We first showed that currently available ILs and CC protocols are not suitable for broadcast-based applications issuing read-write transactions since they either lack any data currency guarantees or do not ensure serializability. To rectify the problem, we proposed a suite of five new MVCC pro- tocols, dubbed MVCC-BOT, MVCC-BOTO, MVCC-IBOT, MVCC-IBOTO, and MVCC-EOT, which all ensure serializability along with BOT, IBOT, and EOT data currency, respectively. For each of those protocols, we first specified their intended semantic guarantees, then defined rules and con- ditions that need to be applied and satisfied by the scheduler in order to enforce those guarantees, and finally showed that the protocols produce only correct histories in accordance to their specifica- tions. We also discussed various issues influencing the performance of the MVCC-* suite including caching, intermittent connectivity, and exploiting semantic information about the read-write trans- actions being processed. We then argued that our integrated caching and prefetching policy MICP also suits the data preservation and storage requirements of the MVCC-* family and should there- fore be deployed at any mobile client that requires transaction support. We further discussed the impact of network disconnections and network failures on the performance and operations of the

MVCC-* protocols. We identified the MVCC-IBOT and MVCC-IBOTO protocols as being vul- nerable to disconnections and proposed two alternative ways to rectify the problem. Thereafter, we showed how semantic knowledge about the objects and the operations that operate on them can be used to identify false or permissive conflicts that occurred by using the MVCC-* suite and how real conflicts can be resolved by means of specifying alternative write choices as part of their original operations. Again, we concluded the chapter by presenting the results of a set of simulation-based studies that investigated the relative performance differences of the protocols of the MVCC-* suite and compared their performance with the well-known Snapshot Isolation scheme which provides slightly weaker consistency guarantees than our protocols, hence allowing us to study the perfor- mance trade-off between serializability and a weaker consistency level in mobile broadcast-based 242 Chapter 7. Conclusion and Future Work environments. The results showed that MVCC-IBOTO is the best performing protocols among those of the MVCC-* suite, followed by MVCC-BOTO and MVCC-EOT. Further, the experiments revealed that the costs of providing Full Serializability instead of the weaker Snapshot Isolation guarantees to read-write transactions is, on average, about 40% in mobile networks which is signif- icantly higher than the costs imposed to the clients in stationary networks [9]. We also presented results of a detailed sensitivity analysis which was conducted to understand how different system and workload parameters including the data contention level, provision of alternative write opera- tions, network connectivity, and caching and prefetching policy affect the relative performance of the protocols of the MVCC-* suite and the Snapshot Isolation scheme. The results can be summa- rized as follows: (a) Increasing the number of read-write transactions in the workload increases the data contention in the system and thus, widens the throughput difference between the protocols of the MVCC-* suite and the Snapshot Isolation protocol. (b) Specifying alternative write operations in addition to their originally scheduled updates leads to a significant performance improvement of at least 40% for the investigated protocols. (c) If the client suffers from intermittent connectivity to the server, the throughput performance decreases only moderately if the disconnections are short and constrained to the back-channel to the server. Otherwise, if clients are completely disconnected from the hybrid network, the performance of the investigated protocols may degrade significantly even under relatively low disconnection probabilities. (d) Using MICP-L rather than LRFU-P as client caching and prefetching policy improves the system performance by about 6% on average, making it the first choice for multi-version client-server environments.

7.2 Future Work

We believe that the results of the thesis have far reaching implications, as they provide system developers with guidance in deploying large scale information dissemination systems whose task is to deliver consistent and timely data to read-only and read-write transactions executing at mobile clients. There is a lot of interesting research work to be done in the future and some possible directions are highlighted below: 7.2. Future Work 243

• The thesis has shown that the cost of providing serializability to read-only transactions com-

pared to weaker consistency guarantees including Update Serializability and View Consis-

tency is not excessively high (≤ 10%) in broadcast-based data delivery networks. How-

ever, the cost of providing strong consistency guarantees such as Full Serializability to read-

write transaction compared to slightly weaker consistency conditions such as Snapshot Iso-

lation is significantly higher (approximately 40%). Thus, it would be beneficial to execute

dissemination-based applications that do not only inspect, but also modify the database state

below serializability. However, and as mentioned in previous parts of the thesis, an impor-

tant drawback of weaker consistency guarantees than serializability is that applications may

destroy database integrity if the application programmer does not analyze the program code

for potential conflicts with other transactions and prevents them to occur. Analyzing and de-

tecting viable, i.e., non-interfering, interleavings of the execution of transactions at weaker

ILs than serializability is a non-trivial and error-prone task and should therefore be sup-

ported by a tool providing the programmer with an intuitive interface to perform the analysis

semi-automatically. The development of a preferably graph-based tool which partially au-

tomates the analysis process and visually supports the conflict detection and avoidance task,

thus takes a great burden from the application programmer, is a particularly important and

challenging research issue being currently under investigation in two independent research

groups [50, 104]. We believe that the existence of such a tool would certainly contribute

to the further promotion of semantics-based CC mechanisms in academia and, more impor-

tantly, in the commercial world.

• This thesis presented various MVCC protocols that provide efficient and scalable transaction

support for dissemination-based applications issuing both read-only and read-write transac-

tions. Our protocols have been presented with the assumption that the sizes of the objects

being disseminated are relatively small (in the range from a few bytes to several dozens of

bytes). While the size of the objects is relatively unimportant for read-only transaction as

long as it is small relative to the cache size, this is certainly not true for read-write trans-

actions. In contrast to read-only transactions which can immediately be committed by the 244 Chapter 7. Conclusion and Future Work

clients once the last read operation has been processed, read-write transactions require the

clients to communicate with the server to finally validate and, if successful, commit them. To

enable the server to validate a committing read-write transaction Ti against previously com- mitted read-write transactions, the client maintains various information about the data being

accessed and written by Ti (e.g., Ti’s read and write set) and sends them to the server along

with copies of the objects modified by Ti. Obviously, if the copies of modified objects are rel-

atively small in size compared to Ti’s validation information, their transmission through the

wireless medium does not take much additional time and, therefore, Ti’s final validation can be performed without much delay. However, if updated objects are large and thus, require

a large amount of time to be transfered to the server, the probability of the transaction not

being successfully validated increases as the effective degree of transaction concurrency and

data contention grows. In such a situation a more appropriate strategy might be to propagate

the operations (represented as plain text e.g., SQL, as compiled code, or as calls to stored

procedures) that modify the objects to the server rather than the modified objects themselves.

Clearly, using a so-called function-shipping approach to integrate the transaction updates into

the common database state does not come for free since it incurs additional load on the server

to re-execute the operations once again, i.e., the approach trades-off reduced network com-

munication costs for an increase in the CPU costs of the server. In this respect it would be an

interesting research topic to investigate under which system and workload conditions either

a data-shipping or function-shipping approach is superior to the other when used to transfer

client updates to the broadcast server.

• In this thesis, we have proposed a suite of new MVCC protocols, called MVCC-*, that are

based on optimism allowing clients to immediately execute read-write transactions using the

information stored in the client cache or air-cache without any extra communication with the

broadcast server in case an object has to be accessed or modified. This is clearly in contrast

to lock-based schemes that may require such communication to obtain appropriate permis-

sion (e.g., read or write permission) before an object can be accessed or modified. While

the independence of client read and write operations from the server activity is useful to 7.2. Future Work 245

overcome the latency and bandwidth problems prevalent in mobile environments, at the end

of a transaction Ti, however, the broadcast server needs to check Ti’s validation information against the CC information of earlier committed transactions with the objective to detect non-

serializable transaction executions. Whenever Ti is identified to be involved in a consistency violating conflict with one or more recently committed read-write transactions, the detected

conflict needs to be resolved somehow. As the majority of transaction control protocols pro-

posed in the literature, our MVCC-* protocols resolve data conflicts by simply aborting the

conflicting transaction. Using this approach has the advantage that no human intervention

is required to resolve data conflicts, i.e., no application-specific conflict detection and reso-

lution rules need to be specified by the application programmer and/or users. On the other

hand, the abort-based conflict resolution approach suffers from poor resource utilization and

has the drawback of potentially causing transactions to starve which is the case when a trans-

action repeatedly fails to commit. As starvation is a well-known problem for optimistic CC

schemes, numerous solutions have been proposed in the literature. The pioneering paper on

optimistic schemes by Kung et al. [96] suggests detecting a starving transactions by counting

the number of successive transaction aborts. To rescue the system from starving transac-

tions, the authors propose to use semaphores to effectively lock the entire database during the

transaction’s restart to ensure that the re-execution succeeds. Rahm et al. [131] proposed to

solve the problem by using page-locks rather than database locks being acquired after a read-

write transaction is aborted for the very first time to avoid multiple transaction restarts. More

recently and within the context of mobile databases and the presence of intermittent connec-

tivity, Preguic¸a et al. [126] suggested to use object type- and operation-specific reservations,

i.e., locks, that are associated with time leases which guarantee that reservations will not be

held forever, even if the mobile client that holds the reservation becomes permanently discon-

nected. An interesting area of future work is to use simulation to investigate the impact of the

various approaches trying to solve the starvation problem on the overall system performance.

The solution space to be examined comprises two components: (a) starvation detection and

(b) starvation resolution. For starvation detection, the evaluation of different heuristics such 246 Chapter 7. Conclusion and Future Work

as the number of transaction restarts (in the range of 1 to some upper bound N), the transac-

tion age, or the amount of wasted resources (e.g., battery power) useful to indicate starvation

w.r.t. their impact on the overall system performance (under various workload conditions)

would be worthwhile to study in order to gain insights into the pros and cons of the individ-

ual approaches. For starvation resolution of a starving read-write transaction Ti, the following

four questions arise: What objects should be protected from being updated during Ti’s restart

execution? Should the server protect all objects accessed and updated during Ti’s last unsuc- cessful execution or only those that produced conflicts. What level of lock granularity should

be used to guarantee that no conflict will arise when Ti is re-executed? If the restart change

probability is 0%, i.e., Ti accesses and updates the same set of objects during its re-execution,

then object-level or even object field-level locking would be appropriate. If Ti performs new accesses and updates, then page-level or even table-level locking should be used. What lease

time should be attached to locks and what is the impact of mispredicting the optimal lease

times (too short or too long) on the system performance?

• In this thesis all performance evaluations have been carried out using a synthetic workload

model to generate a stream of read and write requests issued by the client and server trans-

actions. In the light of the boosted interest of the industry in wireless data services and

the emergence of a continuous broadcast network, called DirectBand Network [108], using

FM radio sub-carrier frequencies to deliver timely information to people with SPOT-enabled

devices (e.g., PDAs or watches), it would be useful to re-validate the performance of our al-

gorithms by experimenting with real workload traces gathered from a large number of clients

using DirectBand Network services. By re-running the experiments with real traces and

comparing the gathered results with those of our experiments, it would be interesting to see

whether the choice of the workload model (synthetic vs. real) affects the relative performance

and overall ranking of the studied CC protocols and cache replacement and prefetching poli-

cies. If the experiments would reveal differences in the performance among the algorithms,

it would then be important to investigate the cause of this inconsistent behavior in order to

draw conclusions for future studies in this area. Bibliography

[1] R. Abbott and H. Garcia-Molina, “Scheduling Real-Time Transactions: A Performance Eval-

uation,” in VLDB 1988, 1988, pp. 1–12.

[2] N. Abramson, “The ALOHA System — Another Alternative for Computer Communication,”

in Proc. of the Fall Joint Computer Conference, 1970, pp. 281–285.

[3] S. Acharya, “Broadcast Disks: Dissemination-based Data Management for Asymmetric

Communication Environments,” Brown University, Department of Computer Science, Tech.

Rep. CS-97-15, 1997.

[4] S. Acharya, R. Alonso, M. Franklin, and S. Zdonik, “Broadcast Disks: Data Management

for Asymmetric Communications Environments,” in Proc. ACM SIGMOD Conf., 1995, pp.

199–210.

[5] S. Acharya, M. Franklin, and S. Zdonik, “Prefetching from a Broadcast Disk,” in ICDE 1996,

Februar 1996, pp. 276–285.

[6] ——, “Balancing Push and Pull for Data Broadcast,” in Proc. ACM SIGMOD Conf., 1997,

pp. 183–194.

[7] S. Acharya, M. J. Franklin, and S. B. Zdonik, “Dissemination-Based Data Delivery Us-

ing Broadcast Disks,” IEEE Personal Communications, vol. 2, no. 6, pp. 50–60, December

1995.

[8] F. Adachi, “Fundamentals of Multiple Access Techniques,” in Wireless Communications in

the 21st Century, M. Shafi, S. Ogose, and T. Hattori, Eds. Wiley-IEEE Press, 2002.

247 248 Bibliography

[9] A. Adya, “Weak Consistency: A Generalized Theory and Optimistic Implementations for

Distributed Transactions,” MIT Laboratory for Computer Science, Cambridge, MA, Tech.

Rep. MIT/LCS/TR-786, March 1999.

[10] A. Adya, B. Liskov, and P. O’Neil, “Generalized Isolation Level Definitions,” in ICDE 2000,

2000, pp. 67–78.

[11] D. Agrawal, A. E. Abbadi, and A. K. Singh, “Consistency and Orderability. Semantics-Based

Correctness Criteria for Databases,” ACM TODS, vol. 18, no. 3, pp. 460–486, 1993.

[12] AirTV, Inc., “The Official AirTV Website,” 2004. [Online]. Available: http://www.airtv.net

[13] R. Alonso and H. F. Korth, “Database System Issues in Nomadic Computing,” in Proc. ACM

SIGMOD Conf. 1993, 1993, pp. 388–392.

[14] ANSI X3.135-1992 — Database Language SQL, American National Standart for Information

Systems, 1819 L Street, NW, Washington, DC 20036, USA, 1992.

[15] M. H. Ammar and J. Wong, “The Design of Teletext Broadcast Cycles,” Performance Eval-

uation, vol. 5, no. 4, pp. 235–242, 1985.

[16] Anonymous, “Wireless LANs: Comparison of Wireless LAN Standards — 802.11a versus

802.11b,” 2001. [Online]. Available: http://www.mobileinfo.com/Wireless LANs/802.11a

802.11b.htm

[17] H. Balakrishnan and V. N. Padmanabhan, “How Network Asymmetry Affects TCP,” IEEE

Communications Magazine, vol. 39, no. 4, pp. 60–66, April 2001.

[18] H. Balakrishnan, V. N. Padmanabhan, G. Fairhurst, and M. Sooriyabandara, “TCP

Performance Implications of Network Path Asymmetry,” 2002. [Online]. Available:

ftp://ftp.isi.edu/in-notes/rfc3449.txt

[19] K. Banh, “Kenny’s PDA (Personal Digital Assistant) Guide,” 2004. [Online]. Available:

http://www.pages.drexel.edu/∼kvb22/ Bibliography 249

[20] D. Barbara´ and T. Imielinski, “Sleepers and Workaholics: Caching Strategies in Mobile

Environments,” in Proc. ACM SIGMOD Conf., 1994, pp. 1–12.

[21] D. Barbara´ and T. Imielinski, “Sleepers and Workaholics: Caching Strategies in Mobile

Environments,” VLDB Journal, vol. 4, no. 4, pp. 567–602, 1995.

[22] R. Bayer and C. McCreight, “Organization and Maintenance of Large Ordered Indexes,” in

Acta Informatica 1, 1972, pp. 173–189.

[23] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O’Neil, and P. O’Neil, “A Critique of ANSI

SQL Isolation Levels,” in Proc. ACM SIGMOD Conf., June 1995, pp. 1–10.

[24] A. Bernstein, D. Gerstl, and P. Lewis, “Concurrency Control for Step-Decomposed Transac-

tions,” Information Systems, vol. 24, no. 8, 1999.

[25] A. J. Bernstein, D. S. Gerstl, W. H. Leung, and P. M. Lewis, “Design and Performance of an

Assertional Concurrency Control System,” in ICDE 1998, 1998, pp. 436–445.

[26] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in

Database Systems. Addison-Wesley, 1987.

[27] Bluetooth Special Interest Group, “Bluetooth Core Specification v1.2,” Nov. 2003. [Online].

Available: https://www.bluetooth.org/spec

[28] ——, “The Official Bluetooth SIG Website,” 2004. [Online]. Available:

http://www.bluetooth.com

[29] P. M. Bober and M. J. Carey, “Multiversion Query Locking,” in VLDB 1992, August 1992,

pp. 497–510.

[30] T. Bowen, G. Gopal, G. Herman, T. Hickey, K. Lee, W. Mansfield, J. Raitz, and A. Weinrib,

“The Datacycle Architecture,” CACM, vol. 35, no. 12, pp. 71–81, 1992.

[31] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web Caching and Zipf-like Distri-

butions: Evidence and Implications,” in Infocom 1999, 1999, pp. 126–134. 250 Bibliography

[32] P. Cao, E. W. Felten, A. Karlin, and K. Li, “A Study of Integrated Prefetching and Caching

Strategies,” in ACM SIGMETRICS 1995, 1995, pp. 188–197.

[33] E. Chang and R. Katz, “Exploiting Inheritance and Structure Semantics for Effective Clus-

tering and Buffering in an Object-Oriented DBMS,” in Proc. ACM SIGMOD Conf., 1989,

pp. 348–357.

[34] W. W. Chang and H. J. Schek, “A Signature Access Method for the Starburst Database Sys-

tem,” in VLDB 1989, 1989, pp. 145–153.

[35] M. S. Chen, K. L. Wu, and P. S. Yu, “Optimizing Index Allocation for Sequential Data

Broadcasting in Wireless Mobile Computing,” IEEE TKDE, vol. 15, no. 1, pp. 161–173,

2003.

[36] C. M. Cordeiro, H. Gossain, R. L. Ashok, and D. P. Agrawal, “The Last Mile: Wireless Tech-

nologies for Broadband and Home Networks,” in In 21th Brazilian Symposium on Computer

Networks, 2003, pp. 119–178.

[37] M. Dankberg and J. Puetz, “Comparative Approaches in the Eco-

nomics of Broadband Satellite Services,” January 2002. [Online]. Available:

http://www.satelliteonthenet.co.uk/white/viasat1.html

[38] A. Datta, A. Celik, J. G. Kim, D. E. VanderMeer, and V. Kumar, “Adaptive Broadcast Pro-

tocols to Support Power Conservant Retrieval by Mobile Users,” in ICDE 1997, 1997, pp.

124–133.

[39] S. B. Davidson, H. Garcia-Molina, and D. Skeen, “Consistency in a Partitioned Network: A

Survey,” ACM Comput. Surv., vol. 17, no. 3, pp. 341–370, 1985.

[40] P. J. Denning, “On Modeling Program Behaviour,” in Proceedings Spring Joint Computer

Conference, Arlington, VA., 1972, pp. 937–944.

[41] P. Deolasee, A. Katkar, A. Panchbudhe, K. Ramamritham, and P. Shenoy, “Dissemination of

Dynamic Data,” in Proc. ACM SIGMOD Conf., 2001, p. 599. Bibliography 251

[42] D. J. DeWitt, D. Maier, P. Futtersack, and F. Velez, “A Study of Three Alternative

Workstation-Server Architectures for Object-Oriented Database Systems,” in VLDB 1990,

1990, pp. 107–121.

[43] G. Diviney, “An Introduction to Short-Range Wireless Data Communications,” in Embedded

Systems Conference, San Francisco, April 2003.

[44] W. Effelsberg and T. Haerder, “Principles of Database Buffer Management,” ACM TODS,

vol. 9, no. 4, pp. 560–595, 1984.

[45] L. D. Eife and L. Gruenwald, “Research Issues for Data Communication in Mobile Ad-Hoc

Network Database Systems,” ACM SIGMOD Record, vol. 32, no. 2, pp. 42–47, 2003.

[46] M. Engels, Wireless OFDM Systems: How to Make Them Work? Kluwer Academic Pub-

lishers, 2002.

[47] S. Evans, “Last Mile Technologies,” 2000. [Online]. Available:

http://www.telsyte.com.au/feature/last mile.htm

[48] C. Faloutsos and S. Christodoulakis, “Signature Files: An Access Method for Documents

and its Analytical Performance Evaluation,” ACM Trans. Inf. Syst., vol. 2, no. 4, pp. 267–

288, 1984.

[49] A. A. Farrag and M. T. Ozsu,¨ “Using Semantic Knowledge of Transactions to Increase Con-

currency,” ACM TODS, vol. 14, no. 4, pp. 503–525, 1989.

[50] A. Fekete, D. Liarokapis, E. O’Neil, P. O’Neil, and D. Shasha, “Making Snapshot Isolation

Serializable,” 2004, accepted for publication in ACM TODS.

[51] M. J. Franklin and S. B. Zdonik, “Dissemination-Based Information Systems,” IEEE Bulletin

of the Technical Committee on Data Engineering, vol. 19, no. 3, pp. 20–30, September 1996.

[52] H. Garcia-Molina, “Using Semantic Knowledge for Transaction Processing in a Distributed

Database,” ACM TODS, vol. 8, no. 2, pp. 186–213, 1983. 252 Bibliography

[53] S. Ghemawat, “The Modified Object Buffer: A Storage Management Technique for Object-

Oriented Databases,” MIT Laboratory for Computer Science, Cambridge, MA, Tech. Rep.

MIT/LCS/TR-666, September 1995.

[54] J. D. Gibson, Ed., The Mobile Communications Handbook, 2nd ed. IEEE Press, 1999.

[55] J. Gray and G. Graefe, “The Five-Minute Rule Ten Years Later, and Other Computer Storage

Rules of Thumb,” SIGMOD Record, vol. 26, no. 4, pp. 63–68, 1997.

[56] Groningen Growth and Development Centre, “60-Industry Database,” 2003. [Online].

Available: http://www.ggdc.net/dseries/60-industry.shtml

[57] P. G. D. Group, PostgreSQL: The Most Advanced Open Source Database System in the

World, 2004. [Online]. Available: http://www.postgresql.org

[58] R. E. Gruber, “Optimism vs. Locking: A Study of Concurrency Control for Client-Server

Object-Oriented Databases,” MIT Laboratory for Computer Science, Cambridge, MA, Tech.

Rep. MIT/LCS/TR-708, 1997.

[59] A. Gurtov, “Efficient Transport in 2.5G3G Wireless Wide Area Networks,” Ph.D. disserta-

tion, University of Helsinki, 2002.

[60] S. Hameed and N. H. Vaidya, “Log-Time Algorithms for Scheduling Single and Multiple

Channel Data Broadcast,” in MobiCom 1997, 1997, pp. 90–99.

[61] R. C. Hansdah and L. M. Patnaik, “Update Serializability in Locking,” in International Con-

ference on Database Theory, 1986, pp. 171–185.

[62] T. Harder,¨ “Observations on Optimistic Concurrency Control Schemes,” Information Sys-

tems, vol. 9, no. 2, pp. 111–120, 1984.

[63] J. R. Haritsa, M. J. Carey, and M. Livny, “On Being Optimistic about Real-Time Constraints,”

in ACM PODS, 1990, pp. 331–343. Bibliography 253

[64] T. Henderson, “Networking over Next-Generation Satellite Systems,” Ph.D. dissertation,

University of California at Berkeley, 1999.

[65] T. Henderson and R. Katz, “Transport Protocols for Internet-compatible Satellite Networks,”

IEEE Journal on Selected Areas of Communications, vol. 17, no. 2, pp. 345–359, 1999.

[66] M. Herlihy and W. E. Weihl, “Hybrid Concurrency Control for Abstract Data Types,” JCSS,

vol. 43, no. 1, pp. 25–61, 1991.

[67] G. Herman, G. Gopal, K. Lee, and A. Weinrib, “The Datacycle Architecture for Very High

Throughput Database Systems,” in Proc. ACM SIGMOD Conf., 1987, pp. 97–103.

[68] W. W. Hsu, A. J. Smith, and H. C. Young, “Characteristics of Production Database Workloads

and the TPC Benchmarks,” IBM Systems Journal, vol. 40, no. 3, pp. 781–802, 2001.

[69] Q. Hu, L. D. L., and W.-C. Lee, “A Comparison of Indexing Methods for Data Broadcast

on the Air,” in Proceedings of the 12th International Conference on Information Networking

(ICOIN-12), 1998, pp. 656–659.

[70] Hughes Network Systems, DirecPC Home Page, January 2002. [Online]. Available:

http://www.direcpc.com/

[71] ——, DIRECWAY Home Page, July 2004. [Online]. Available: http://www.direcway.com/

[72] IEEE 802 LAN/MAN Standards Committee, “The Offical The IEEE 802.16 Working Group

on Broadband Wireless Access Standards .” [Online]. Available: http://www.ieee802.org/16

[73] T. Imielinski and B. R. Badrinath, “Data Management for Mobile Computing,” SIGMOD

Record, vol. 22, no. 1, pp. 34–39, 1993.

[74] ——, “Mobile Wireless Computing: Challenges in Data Management,” CACM, vol. 37,

no. 10, pp. 18–28, 1994.

[75] T. Imielinski, S. Viswanathan, and B. R. Badrinath, “Energy Efficient Indexing on Air,” in

Proc. ACM SIGMOD Conf. 1994. ACM Press, 1994, pp. 25–36. 254 Bibliography

[76] ——, “Power Efficient Filtering of Data an Air,” in EDBT 1994, 1994, pp. 245–258.

[77] ——, “Data on Air: Organization and Access,” IEEE Transactions on Knowledge and Data

Engineering, vol. 9, no. 3, pp. 353–372, 1997.

[78] ——, “Scheduling Data Broadcast in Asymmetric Communication Environments,” Knowl-

edge and Data Eng., IEEE Trans., vol. 9, no. 3, pp. 353–372, 1997.

[79] S. C. Inc., “Starband Home Page,” 2004. [Online]. Available: http://www.starband.com/

[80] Infrared Data Association — IrDA, “Serial Infrared Physical Layer Specification, ver. 1.4,”

May 2001. [Online]. Available: http://www.irda.org

[81] ——, “The Official IrDA Website,” 2004. [Online]. Available: http://www.irda.org

[82] K. Jacobs, “Concurrency Control: Transaction Isolation and Serializability in SQL92 and

Oracle7,” Oracle Corporation, Oracle White Paper Part No. A33745, July 1995.

[83] H. S. Jeon and S. H. Noh, “A Database Disk Buffer Management Algorithm Based on

Prefetching,” in Proceedings of ACM CIKM, Bethesda, Maryland, USA, 1998, pp. 167–174.

[84] D. G. Jeong and J. W. Sook, “CDMA/TDD System for Wireless Multimedia Services with

Traffic Unbalance between Uplink and Downlink,” IEEE Journal on Selected Areas in Com-

munications, vol. 17, no. 5, pp. 939–946, 1999.

[85] T. Johnson and D. Shash, “2Q: A Low Overhead High Performance Buffer Management

Replacement Algorithm,” in VLDB 1994, 1994, pp. 439–450.

[86] C. E. Jones, K. M. Sivalingam, P. Agrawal, and J.-C. Chen, “A Survey of Energy Efficient

Network Protocols for Wireless Networks,” Wireless Networks, vol. 7, no. 4, pp. 343–358,

2001.

[87] S. K. Joo and T. C. Wan, “Incorporation of QoS and Mitigated TCP/IP over Satellite Links,”

in Proc. 1 Asian Int’l Mobile Computing Conference (AMOC 2000), 2000. Bibliography 255

[88] C. P. K. L. Wu, P. S. Yu, “Divergence Control for Epsilon Serializability,” in ICDCS, Febru-

ary 1992, pp. 506–515.

[89] S. Khanna and V. Liberatore, “On Broadcast Disk Paging,” in ACM STOCS 1998, 1998, pp.

634–643.

[90] ——, “On Broadcast Disk Paging,” SIAM Journal on Computing, vol. 29, no. 5, pp. 1683–

1702, 2000.

[91] S.-W. Kim and H.-S. Won, “Batch-construction of B+-trees,” in Proceedings of the 2001

ACM symposium on Applied Computing, 2001, pp. 231–235.

[92] L. Kleinrock and F. Tobagi, “Packet Switching in Radio Channels: Part 1 — Carrier Sense

Multiple-access Models and their Throughput-delay Characteristics,” IEEE Trans. Com-

muni., vol. 23, no. 12, pp. 1400–1416, 1975.

[93] D. E. Knuth, The Art of Computer Programming: Sorting and Searching, 2nd ed. Addison-

Wesley, 1998, vol. 3.

[94] R. Kravets, K. Schwan, and K. Calvert, “Power-aware Communication for Mobile Comput-

ers,” in International Workshop on Mobile Multimedia Communications 1999, 1999.

[95] N. Krishnakumar and A. J. Bernstein, “High Throughput Escrow Algorithms for Replicated

Databases,” in VLDB 1992, 1992, pp. 175–186.

[96] H. T. Kung and J. T. Robinson, “On Optimistic Methods for Concurrency Control,” ACM

TODS, vol. 6, no. 2, pp. 213–226, 1981.

[97] E. R. Lassettre, “Olympic Records for Data at the 1998 Nagano Games,” in SIGMOD 1998,

L. M. Haas and A. Tiwary, Eds., 1998, p. 537.

[98] D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, “LRFU: A Spectrum

of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies,”

IEEE Transactions on Computers, vol. 50, no. 12, pp. 1352–1361, 2001. 256 Bibliography

[99] D. Lee, S. H. N. J. H. Kim, S. L. Min, J. Choi, Y. Cho, and C. S. Kim, “On the Existence of

a Spectrum of Policies that subsumes the Least Recently Used (LRU) and Least Frequently

Used (LFU) Policies,” in ACM SIGMETRICS 1999, 1999, pp. 134–143.

[100] S. Y. Lee, M. C. Yang, and J. W. Chen, “Signature File as a Spatial Filter for Iconic Image

Database,” pp. 373–397, 1992.

[101] V. Lee, S. H. Son, and K. Lam, “On the Performance of Transaction Processing in Broadcast

Environments,” in MDA 1999, 1999, pp. 61–70.

[102] V. C. S. Lee and K. Lam, “Optimistic Concurrency Control in Broadcast Environments:

Looking Forward at the Server and Backward at the Clients,” in MDA 1999, 1999, pp. 97–

106.

[103] W. C. Lee and D. L. Lee, “Using Signature Techniques for Information Filtering in Wireless

and Mobile Environments,” DPDB, vol. 4, no. 3, pp. 205–227, 1996.

[104] S. Lu, A. Bernstein, and P. Lewis, “Correct Execution of Transactions at Different Isolation

Levels,” IEEE TKDE, vol. 16, no. 9, pp. 1070–1081, 2004.

[105] R. Ludwig, A. Gurtov, and F. Khafizov, “TCP over Second (2.5G)

and Third (3G) Generation Wireless Networks,” 2003. [Online]. Available:

http://www.ietf.org/rfc/rfc3481.txt?number=3481

[106] J. Martin, Communications Satellite Systems. Prentice Hall, 1978.

[107] Maxtor Corporation, “Maxtor Atlas 10K III - Product Manual,” 2002. [Online].

Available: http://www.maxtor.com/ files/maxtor/en us/documentation/manuals/atlas10k iii

manual.pdf

[108] Microsoft Corporation, “DirectBand Network. Microsoft Smart Personal Objects Technol-

ogy (SPOT),” 2005. [Online]. Available: http://www.microsoft.com/resources/spot/ Bibliography 257

[109] C. Mohan, H. Pirahesh, and R. Lorte, “Efficient and Flexible Methods for Transient Version-

ing of Records to Avoid Locking by Read-only Transactions,” in Proc. ACM SIGMOD Conf.,

1992, pp. 124–133.

[110] E. Mok, H. V. Leong, and A. Si, “Transaction Processing in an Asymmetric Mobile Environ-

ment,” in MDA, 1999, pp. 71–81.

[111] T. Nakajima, “Commutativity Based Concurrency Control and Recovery for Multiversion

Objects,” in International Workshop on Distributed Object Management, 1992, pp. 231–

247.

[112] J. H. Oh, K. A. Hua, and K. Prabhakara, “A New Broadcasting Technique for an Adaptive

Hybrid Data Delivery in Wireless Mobile Network Environment,” in Proc. of 19th IEEE

International Performance, Computing and Communications Conference, 2000, pp. 361–

367.

[113] J. H. Oh, K. A. Hua, and K. Vu, “An Adaptive Hybrid Technique for Video Multicast,”

in IEEE International Conference on Computer Communications and Networks, 1998, pp.

227–234.

[114] B. Oki, M. Pfluegl, A. Siegel, and D. Skeen, “The Information Bus: An Architecture for

Extensible Distributed Systems,” in 14th ACM Symposium on Operating System Principals,

Asheville, NC, 1993.

[115] E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The LRU-K Page Replacement Algorithm for

Database Disk Buffering,” in Proc. ACM SIGMOD Conf., 1993, pp. 297–306.

[116] P. E. O’Neil, “The Escrow Transactional Method,” ACM TODS, vol. 11, no. 4, pp. 405–430,

1986.

[117] Oracle Corporation, “Concepts: 10g Release 1,” Oracle 10g Documentation, Part No.

B10743-01, December 2003. 258 Bibliography

[118] J. O’Toole and L. Shrira, “Opportunistic Log: Efficient Installation Reads in a Reliable Stor-

age Server,” in Operating Systems Design and Implementation, 1994, pp. 39–48.

[119] M. Palmer and S. B. Zdonik, “Fido: A Cache That Learns to Fetch,” in VLDB 1991, 1991,

pp. 255–264.

[120] C. Papadimitriou, The Theory of Database Concurrency Control. Computer Science Press,

1986.

[121] S. H. Phatak and B. R. Badrinath, “Multiversion Reconciliation for Mobile Databases,” in

ICDE 1999, 1999, pp. 582–589.

[122] E. Pitoura and B. Bhargava, “Maintaining Consistency of Data in Mobile Distributed Envi-

ronments,” in ICDCS 1995, 1995, pp. 404–413.

[123] E. Pitoura and P. Chrysanthis, “Exploiting Versions for Handling Updates in Broadcasting

Disks,” in VLDB, 1999, pp. 114–125.

[124] ——, “Scalable Processing of Read-Only Transactions in Broadcast Push,” in ICDCS, 1999,

pp. 432–439.

[125] E. Pitoura and G. Samaras, Data Management for Mobile Computing. Kluwer Academic

Publishers, 1998, vol. 10.

[126] N. Preguic¸a, J. L. Martins, M. Cunha, and H. Domingos, “Reservations for Conflict Avoid-

ance in a Mobile Database System,” in MobiSys 2003, 2003, pp. 43–56.

[127] N. M. Preguic¸a, C. Baquero, F. Moura, J. L. Martins, R. Oliveira, H. J. L. Domingos, J. O.

Pereira, and S. Duarte, “Mobile Transaction Management in Mobisnap,” in ADBIS-DASFAA,

2000, pp. 379–386.

[128] J. G. Proakis, Digital Communications, 4th ed. McGraw Hill, 2000.

[129] M. B. Pursley, “The Role of Spread Spectrum in Packet Radio Networks,” Processings of the

IEEE, vol. 75, no. 1, pp. 116–134, 1987. Bibliography 259

[130] F. Rabitti and P. Zezula, “A Dynamic Signature Technique for Multimedia Databases,” in Proceedings of the 13th annual international ACM SIGIR conference on Research and de-

velopment in information retrieval, 1990, pp. 193–210.

[131] E. Rahm and A. Thomasian, “A New Distributed Optimistic Concurrency Control Method

and a Comparison of its Performance with Two-Phase Locking,” in Proceedings of the 10th

International Conference on Distributed Computing Systems, 1990, pp. 294–301.

[132] T. S. Rapport, Wireless Communications Principles and Practice, 2nd ed. Prentice-Hall,

Inc., 2002.

[133] I. Rubin, “Access-control Disciplines for Multiaccess Communications Channels: Reserva-

tion and TDMA Schemes,” IEEE Trans. Inform. Theory, vol. 25, no. 5, pp. 516–526, 1979.

[134] P. Rysavy, “MMDS Struggles to Find a Foothold,” Network Computing, 2001. [Online].

Available: http://www.networkcomputing.com/1222/1222f3.html

[135] M. Sainsbury, “Mobiles on the Move,” The Australian, 2004. [Online]. Available:

http://www.theaustralian.news.com.au/printpage/0,5942,8859609,00.html

[136] H. Schwetman, CSIM Users Guide, November 2002. [Online]. Available:

http://www.mesquite.com/htmls/guides.htm

[137] A. Seifert and M. H. Scholl, “Processing Read-only Transactions in Hybrid Data Delivery

Environments with Consistency and Currency Guarantees,” University of Konstanz, Tech.

Rep. 163, December 2001.

[138] ——, “Processing Read-only Transactions in Hybrid Data Delivery Environments with Con-

sistency and Currency Guarantees,” MONET, vol. 8, no. 4, pp. 327–342, 2003.

[139] J. Shanmugasundaram, A. Nithrakasyap, R. Sivasankaran, and K. Ramamritham, “Efficient

Concurrency Control for Broadcast Environments,” in Proc. ACM SIGMOD Conf., 1999, pp.

85–96. 260 Bibliography

[140] M. Shapiro, A. I. T. Rowstron, and A.-M. Kermarrec, “Application-independent Reconcil-

iation for Nomadic Applications,” in ACM SIGOPS European Workshop 2000, 2000, pp.

1–6.

[141] D. Shasha, F. Llirbat, E. Simon, and P. Valduriez, “Transaction Chopping: Algorithms and

Performance Studies,” ACM TODS, vol. 20, no. 3, pp. 325–363, 1995.

[142] N. Shivakumar and S. Venkatasubramanian, “Energy Efficient Indexing for Information Dis-

semination in Wireless Systems,” MONET, vol. 1, no. 4, pp. 433–446, 1996.

[143] S. Singh, M. Woo, and C. S. Raghavendra, “Power-Aware Routing in Mobile Ad Hoc Net-

works,” in Mobile Computing and Networking, 1998, pp. 181–190.

[144] A. J. Smith, “Disk Cache-Miss Ratio Analysis Design Considerations,” ACM TOCS, vol. 3,

no. 2, pp. 161–203, 1985.

[145] R. Srinivasan, C. Liang, and K. Ramamritham, “Maintaining Temporal Coherency of Virtual

Data Warehouses,” in RTSS 1998, 1998, pp. 60–70.

[146] K. Stathatos, “Air-Caching: Adaptive Hybrid Data Delivery,” Ph.D. dissertation, University

of Maryland, College Park, Maryland, 1999.

[147] K. Stathatos, N. Roussopoulos, and J. S. Baras, “Adaptive Data Broadcast in Hybrid Net-

works,” in VLDB 1997, 1997, pp. 326–335.

[148] C. J. Su and L. Tassiulas, “Broadcast Scheduling for Information Distribution,” in Infocom

1997, 1997, pp. 109–117.

[149] L. Tassiulas and C. J. Su, “Optimal Memory Management Strategies for a Mobile User in

a Broadcast Data Delivery System,” IEEE Journal on Selected Areas in Communications,

vol. 15, no. 7, pp. 1226–1238, 1997.

[150] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer, and C. Hauser,

“Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” in Bibliography 261

Proceedings 15th Symposium on Operating Systems Principles (SOSP-15), 1995, pp. 172–

183.

[151] A. Tomkins, R. H. Patterson, and G. Gibson, “Informed Multiprocess Prefetching and

Caching,” in ACM SIGMETRICS 1997, 1997, pp. 100–114.

[152] K. L. Tripp, “SQL Server 2005 Beta 2 Snapshot Isolation,” 2004. [Online]. Available:

http://www.microsoft.com/technet/prodtechnol/sql/2005/SQL05B.mspx

[153] N. H. Vaidya and S. Hameed, “Scheduling Data Broadcast in Asymmetric Communication

Environments,” ACM Wireless Networks, vol. 5, no. 3, pp. 171–182, 1999.

[154] M. A. Viredaz, L. S.Brakmo, and W. R. Hamburgen, “Energy Management on Handheld

Devices,” Queue, vol. 1, no. 7, pp. 44–52, 2003.

[155] S. R. Viswanathan, “Publishing in Wireless and Wireline Environments,” Ph.D. dissertation,

Rutgers University, 1996.

[156] Vocal Technologies, Ltd., “EDGE — Enhance Data Rate GSM,” 2001. [Online]. Available:

http://www.vocal.com/data sheets/full/edge.pdf

[157] G. D. Walborn and P. K. Chrysanthis, “Supporting Semantics-Based Transaction Processing

in Mobile Database Applications,” in Symposium on Reliable Distributed Systems, 1995, pp.

31–40.

[158] W. E. Weihl, “Data-dependent Concurrency Control and Recovery,” in PODC 1983, 1983,

pp. 63–75.

[159] ——, “Distributed Version Management for Read-only Actions,” SE, vol. 13, no. 1, pp. 55–

64, January 1987.

[160] ——, “Commutativity-Based Concurrency Control for Abstract Data Types,” IEEE Trans-

actions on Computers, vol. 37, no. 12, pp. 1488–1505, 1988. 262 Bibliography

[161] G. Weikum and G. Vossen, Transactional Information Systems: Theory, Algorithms, and

Practice of Concurrency Control and Recovery. Morgan Kaufmann, 2001.

[162] H. Wiederhold, “Read-only Transactions in a Distributed Database,” ACM TODS, vol. 7,

no. 2, pp. 209–234, 1982.

[163] O. Wolfson, A. P. Sistla, S. Chamberlain, and Y. Yesha, “Updating and Querying Databases

that Track Mobile Units,” Distributed and Parallel Databases, vol. 7, no. 3, pp. 257–287,

1999.

[164] J. W. Wong, “Broadcast Delivery,” in Proceedings of the IEEE, 1988, pp. 1566–1577.

[165] World Airline Entertainment Association Internet Working Group (IWG), “Ma-

trix of Service Delivery Options, Version 1.0,” 2001. [Online]. Available:

www.waea.org/tech/techdocs/off-board matrix v10.doc

[166] J. Xu, Q. Hu, D. L. Lee, and W.-C. Lee, “SAIU: An Efficient Cache Replacement Policy for

Wireless On-demand Broadcasts,” in ACM CIKM 2000, 2000, pp. 46–53.

[167] J. Xu, W.-C. Lee, and X. Tang, “Exponential Index: A Parameterized Distributed Indexing

Scheme for Data on Air,” in MobiSys 2004, 2004, pp. 153–164.

[168] G. K. Zipf, Human Behavior and Principle of Least Effort: An Introduction to Human Ecol-

ogy. Addison-Wesley, 1949.