UvA-DARE (Digital Academic Repository)

Scalable distributed data structures for database management

Karlsson, S.J.

Publication date 2000 Document Version Final published version

Link to publication

Citation for published version (APA): Karlsson, S. J. (2000). Scalable distributed data structures for database management.

General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:11 Oct 2021 atabasee Management Scalablee Distributed Data Structures s

for r Databasee Management

Academischh Proefschrift

terr verkrijging van de graad van doctor aann de Universiteit van Amsterdam opp gezag van de Rector Magnificus prof.. dr. J. J. M. Franse tenn overstaan van een door het collegee voor promoties ingestelde commissie, inn het openbaar te verdedigen inn de Aula der Universiteit opp donderdag 14 december 2000, te 11.00 uur

doorr Jonas S Karlsson geborenn te Enköping, Sweden Promotor:: Prof. Dr. M. L. Kersten Faculteit:: Faculteit der Natuurwetenschappen, Wiskunde en Inform;

Thee research reported in this thesis has been partially carried out at thee University of Linköping, Sweden, within the Engineering Database Lab,, a research group at the Department of Computer and Information Sciencee at Linköping Institute of Technology.

Thee research reported in this thesis has been partially carried out at CWI,, the Dutch national research laboratory for mathematics and com- puterr science, within the theme Data Mining and Knowledge Discovery, aa subdivision of the research cluster Information Systems.

Thee research reported in this thesis has been carried out under the auspi- cess of SIKS, the Dutch Graduate School for Information and Knowledge Systems.. SIKS Dissertation Series No 2000-11

ISBNN 90 6196 498 9 Contents s

II Scalable Distributed Data Structures 19

11 Preliminaries 21 1.11 Birthground of SDDSs 21 1.22 SDDSs 22 1.33 Requirements from SDDSs 22 1.44 Data Structures - Background 23 1.4.11 Retrieval Methods 23 1.4.22 Reasonable Properties 24 1.4.33 Basic Data Organization 25 1.4.44 Memory Management/Heaps 26 1.4.55 Linked Lists 27 1.4.66 Chained/Closed Hashing 27 1.4.77 Trees 27 1.4.88 Signature Files 28 1.55 Roundup 29 1.66 LH* (1 dimensional data) 29 1.6.11 LH* Addressing Scheme 30 1.6.22 LH* File Expansion 31 1.6.33 Conclusion 32 1.77 Orthogonal Aspects 33 1.7.11 Performance 33 1.7.22 Dimensions 33 1.7.33 Overhead 34 1.7.44 Distribution and Parallelism 34 1.7.55 Availability 35

22 The LH*LH Algorithm 37 2.11 Introduction 37 2.22 The Server 38 2.2.11 The LH Manager 38 2.2.22 LH* Partitioning of an LH File 39 2.2.33 Concurrent Request Processing and Splitting 42 2.2.44 Shipping 42

3 3 44 CONTENTS

2.33 Notes on LH*LH Communications 43 2.3.11 Communication Patterns 44 2.44 LH*LH Implementation 45 2.4.11 The System Initialization 45 2.4.22 The Data Client 46 2.4.33 The Server 50 2.4.44 Server Mapping 52 2.4.55 Summary and Future Work 54 2.4.66 Host for Scientific Data 55 2.55 Hardware Architecture 56 2.5.11 Communication 57 2.5.22 Measure Suite 58 2.66 Performance Evaluation 58 2.6.11 Scalability 59 2.6.22 Efficiency of Concurrent Splitting 64 2.77 Curiosity 68 2.88 Conclusion 69

33 SDDS for High-Performance Spatial Access 71 3.11 Introduction 72 3.22 hQT* Overview 73 3.2.11 Records 74 3.2.22 Pseudokey Construction 74 3.2.33 Bucket Numbering 75 3.2.44 Addressing 75 3.2.55 File Growth 77 3.33 Distribution in hQT* 77 3.3.11 Distribution (ForwardBuckets) 78 3.3.22 Distributed Point queries 78 3.3.33 Distributed Region Queries 78 3.3.44 IAM Policies 79 3.44 Server Splitting 80 3.4.11 hQT* Splitting 80 3.4.22 Dissection Splitting Algorithm 80 3.55 Measurements 81 3.5.11 Efficiency of IAM Policies 81 3.5.22 Server Load Distribution 85 3.5.33 Discussion 86 3.66 Conclusions 87

44 S7-storage: Multi-Attribute Storage 89 4.11 Introduction 90 4.22 Related Work 90 4.33 The O-storage 92 CONTENTSCONTENTS 5

4.3.11 Buckets and Branch nodes 92 4.3.22 An Example 93 4.3.33 Point Searching 94 4.3.44 Splitting Strategy 94 4.44 Performance evaluation and Tuning 95 4.4.11 Bucket Size vs Pruning 96 4.4.22 Insert costs 97 4.4.33 Search cost for a growing data set 98 4.4.44 Influence of Number of Attributes 100 4.4.55 Comparison with kd-tree 100 4.55 Exploration of the O-tree design space 101 4.5.11 Branches 101 4.5.22 Dynamic hash-function 103 4.5.33 Explored variants of fi-trees 103 4.5.44 Implementation Notes 104 4.66 Conclusions 105

III Applications of SDDSs 107

55 Database Systems 111 5.11 The Need for High Performance Databases Ill 5.22 Conventional Databases 113 5.33 Distributed Databases 113 5.44 Federated Databases 114 5.55 Multidatabases 114 5.66 Data Servers 115 5.77 Parallel Data Servers 115 5.88 Database Machines 117 5.99 Overview of Some Data Servers 118 5.100 DB History 120 5.111 Conclusions 121 5.122 Properties of Structures for Servers 122 5.12.11 The Problem 122 5.12.25.12.2 Scalability 122 5.12.33 Distribution 123 5.12.44 Availability 124 5.12.55 Conclusions 124

66 Scalable Distributed Storage Manager 125 6.11 Seamless SDDS integration in an Extensible DBMS 125 6.22 Introduction 126 6.33 Background 127 6.3.11 Scalable Distributed Data Structures 127 66 CONTENTS

6.3.22 Monet 128 6.44 SDDS within an Extensible Database System 128 6.4.11 SDDS requirements on a DBMS 128 6.4.22 Resource Management 129 6.4.33 SDDS Administration 129 6.4.44 Algebraic Operations 130 6.55 Implementation and Performance Study 131 6.5.11 Optimal Size of a Distributed Partition 132 6.5.22 Overhead added by SDDSs 134 6.5.33 Performance Scalability 134 6.5.44 Discussion 136 6.66 Summary 137

77 Summary & Future Issues 139 7.11 Summary 139 7.22 Extensions to this thesis 139 7.33 Future work 140 Listt of Figures

1.11 LH* File Expansion Scheme 30

2.11 The Data Server 39 2.22 The LH-structure 40 2.33 Pseudo-key usage by LH and LH* 40 2.44 Partitioning of an LH-file by LH* splitting 41 2.55 One node on the Parsytec machine 56 2.66 Static routing on a 64 nodes machine between two nodes. . . 57 2.77 Allocation of servers and clients 58 2.88 Build time of the file for a varying number of clients 59 2.99 Global insert time measure at one client, varying the number off clients 60 2.100 Actual throughput with varying number of clients 61 2.111 Ideal and actual throughput with respect to the number of clientss 62 2.122 Comparison between Static and Dynamic splitting strategy, onee client 63 2.133 Comparison between Static and Dynamic splitting, with four clientss 63 2.144 Efficiency of individual shipping 64 2.155 Efficiency of bulk shipping 65 2.166 Efficiency of the concurrent splitting 66 2.177 LH*LH client insert time scalability 67

3.11 The Record Structure 74 3.22 An offset space-filling curve first 3 layers 75 3.33 Navigation in a) Quad-Tree b) hQT* 76 3.44 Left: hQT* file key space partitioning by 4 successive splits. Right:: The equivalent quad-tree 77 3.55 Split Dissection Algorithm 81 3.66 a) Forward message count using different policies on servers 1,2,33 and 31.b) Only the 3 most efficient strategies 84 3.77 Split distribution over "time" 86

7 7 LISTLIST OF FIGURES

4.11 A bucket of an !7-tree and its attributes 92 4.22 A "typical" ft-marshaled tree 93 4.33 Bucket Split Algorithm 95 4.44 Varying a) search times b) insert times 97 4.55 Search time using a) 1,2,3,8 attributes in O-tree b) details of 88 attribute 98 4.66 a) Search times b) 8, 16 attributes files c) 8 attribute file e) 166 attribute file 99 4.77 8 attribute file, standard deviation for patterns search a) The kd-treee and fi-tree 10% compared with the 0-20% b) KD comparedd with fï-tree with acceptance limit of 10% and 20% 101

5.11 Data and application servers 116

6.11 Local memory on one node, varying sizes of data 133 6.22 One node, distributed access, varying sizes 134 6.33 Constant sized relation 8 MBytes, varying number of nodes. . 135 6.44 Varying sizes, fixed number of nodes (8) 135 6.55 Scale up values; Varying number of nodes, each storing 32 MB. 136 6.66 Comparison between main-memory processing, and dis- tributedd memory 137 Preface e

Researchh is increasingly focusing on using multicomputers [Cul94] [ÖV91] [Tan95].. Multicomputers are built from mass-produced PCs and worksta- tions,, often integerated with high bandwidth networks. Such infrastructures havee emerged rapidly in the last decade and many organizations have typ- icallyy hundreds of networked machines with a large total amount of RAM, CPUU and disk-storage resources. Multicomputerss provide a challenge and promise to cope with the ever- increasingg amount of information using new distributed data structures. Scalablee Distributed Data Structures (SDDSs) form a class of data struc- tures,, first proposed in [LNS93], that allows for scalable distributed files that aree efficient at searching and insertions. The approach taken sets virtually noo upper limit on the number of nodes (computers) that participate in the effort.. Multiple autonomous clients access data on the server-nodes using a locall image to calculate where the data is stored. Their images might be outdated,, but the client is updated when it incurs an addressing error. Thiss thesis contributes to the field of distributed data storage in the contextt of very large database systems. Even though several commercial databasee systems handle large volumes of data (>TBytes at year 2000), the distributionn of this data is still limited to a cluster of local machines. Today, withh growing networking environments, more information is available and needd to be organized to facilitate fast access. Single computers' resources aree commonly shared over the networks using centralized approaches, that eventuallyy cannot handle the load. Alternatively, computers can work to- getherr on one problem abolishing centralized approaches. When computers jointlyy cooperate on storing/processing we refer to them as a multicom- puter.puter. Linear scale-up is important performance target for multicomputers andd parallel machines to fully utilize the increased cost of the hardware and interconnects.. The optimal objective for scale-up is that a doubling of the computerr resources allows the double work(load) to be tackled in the same time.. However, there are many orthogonal aspects that should scale-up, such ass storage, availability, processing, retrieval, and filtering. The research field off Scalable Distributed Data Structures has presented a number of solutions forr storage, etc., but integration or actual implementations have been scarce. Wee present a number of novel scalable data structures, including experiences

9 9 10 0 fromm the first actual implementation of an SDDS on a parallel machine. Fur- thermore,, we show the feasibility of readily integrating an SDDS into an ex- tensiblee database system. The thesis emphasis is on hash-like structures and thee practical experiences learned through simulated experiments and imple- mentationss on multi-computers. One common property of all the structures presentedd herein are that they employ bit-string addressing allowing for low costt storage and navigation in the data structures. Observingg the growing of the amounts of data being made available throughh networks we are convinced that scalable distributed data structures willl have implications on storage as well as processing in the internet and databasee community at the beginning of the new millennium. Thee focus of the thesis is on novel scalable distributed data structures (SDDSs)) for handling large amounts of data. The work is proven practical andd does not engage in aspects of the complexity theory. It deals with ex- plorativeplorative research in the domain of scalable distributed data structures, but iss not conclusive. The thesis is based on a number of publications, listed beloww together with the chapters in which they occur. It means that some partss may be duplicated, such as introduction to SDDSs, keeping the conti- nuityy of the paper. The papers on LH*LH [KLR96] and hQT* [Kar98] deals withh SDDSs, [KK98] deals with an SDDS's integration into a DBMS. The fi-storagee [KarOO] is a novel main-memory structure, which can easily be extendedd to be an SDDS.

Thesiss Overview

Thee thesis is organized as follows. Part I deals with data structures, scalable distributedd data structures in particular. Part II describes usage scenarios wheree SDDSs are implemented and integrated into DBMSs. Inn Part I, Chapter 1 we introduce an SDDS LH* and data structures inn general, completing the chapter with a short overview of some SDDSs. Chapterr 2 presents the first SDDS implementation on a large parallel machinee (SMD). The LH*LH structure developed is based on LH* and ex- tendss earlier work by employing LH internally for local bucket management. Itt identifies the importance of local buffer management for real-time aspects andd discusses hard-won experiences using the Parsytec machine. The chap- terr is based on a publication at the EDBT'96 conference [KLR96] and the subsequentt Licentiate Thesis [Kar97]. Inn Chapter 3, hQT* a novel fully designed spatial data structure is presented.. It includes a effective local storage schema that readily integrates withh the distributed schema minimizing implementation efforts. Partt I of the thesis is concluded in Chapter 4 by introducing a novel main- memoryy tree structure for very large data sets: O-storage a new multi- attributee self organizing data storage structure. It provides an improvement 11 1 overr kd-tree and bit-organized tree-structures by being more resistant to skewedd data input. It can easily be extended to an SDDS scenario by utilizing splittingg and distribution methods from the hQT* structure. Partt II: Chapter 5 starts with an overview of high performance database systems,, their requirements and implications on data structures. Inn Chapter 6, we outline the features of one system that integrate an SDDSS into an extensible DBMS, using the DBMSs native extension inter- face.. By implementation of one such system we show the viability of the conceptt and implicate possible extensions [KK98]. A vision of a new con- ceptt Live Optimization is introduced, which we believe is a promising wayy to simplify distributed query processing. Chapterr 7 concludes the thesis by summarizing contributions and by enumeratingg problems that need further attention.

Financiall Support

Thee first part of my PhD work was supported by NUTEK (The Swedish Na- tionall Board for Industrial and Technical Development), and CENIIT (The Centerr for Industrial Information Technology) and was performed during the yearss 1994 1996 at the Computer Science Department (IDA), at Linköpings University,, Sweden. Since,, 1996 the project was partially funded by the HPCN/IMPACT projectt at CWI, Amsterdam, The Netherlands.

Acknowledgment t

Withoutt the help of others, this thesis would never exist. I'm grateful for thee support from all the people mentioned directly and indirectly in the coming:: Professor Martin Kersten - for encouragement and possibility to continuee the exploration of SDDSs, and for many visionary discussions and brain-stormingg sessions; Professor Arno Siebes - being the best least bossiest bosss I had, and for his always joyful comments; Professor Tore Risch - for my firstt introduction to research and in database systems implementations. Also forr the enthusiastic support during my writing of my Licentiate Thesis, and forr providing an exciting and open discussion atmosphere in the EDSLAB group;; Arjan Pellenkoft - for being a great guy to share an office with, but nott the least his stunningly ability to pinpoint "what-to-fix" with papers; Peterr Boncz - for his friendliness and discussions about the Monet Database System;; Florian Waas - for his good ability to stand my humor, but also forr interesting discussions on problem solving: Wilko Quak - for give-and- takee in the area of "strange" data structures; Witold Litwin - introduced mee to the area of SDDSs through LH*, and the fruitful cooperation on the LH+LH-paper. . Finally,, I would like to thank all my current colleges at CWI as well as formerr colleagues at EDSLAB for the working environment. Special thanks goess to Thomas Padrone McCarthy who thanked me in his thesis for thank- ingg him in mine. By the way, he is great to discuss obscure implementations withh too! Att PELAB (Linköping), I also wish to thank the former members Niclas Andersonn and Lars Viklund for their help on the Parsytec machine; Henrik Nilssonn for our extended discussions on computer "religious" matters in computerr science. Att SVL, IBM I'd like to thank Jay Pederson for doing a final read- throughh and giving constructive feedback. Inn an extended final thanks, I want to mention my family both in Norway andd Sweden, I have not seen you as much as we would have liked to, but youu have always been in my mind. Andd I have to mention my Amsterdam friends, which are the best youu could which for: Tobias "MovieBuff", Frida, Scott & Rachel, Thian && Robert, Maja & Henrik Sydost-Berg, Pter & Cecilia, Martita for a inter-

13 3 14 4 estingg first time in Amsterdam, and last but certainly not the least Kee Jonn Kim.

Thanks, ,

Jonass S Karlsson Amsterdam,, January 2000 Timeline e

Oncee upon a time there was a MSc student. His name was Jonas S. Karlsson. Onee day he went out to look for some teaching jobs at the University of Linkoping.. He saw a notice with a half-time technical programmer job in thee database group. There were two contact persons. In the scary computer sciencee department he walked past the doors looking at the pictures. He endedd up in front Tore Risch's door. He was friendly and hired Jonas. It wass agreed that he would do his Master Thesis on the Transaction Logging andd Recovery for the AMOS (WS-IRIS) main memory database system. Eventually,, he finished his Master Thesis [Kar94], and there was joy. Hee continued for a while as a technical programmer on the AMOS project andd teaching crash courses in programming using lisp. Witold Litwin vis- itedd Linkoping and introduced him to the idea of SDDSs. Unknowingly Jonas wass being trapped! He was going to pursue a PhD. The rest of this book iss the final result. But back to the story. At IDA (the computer science institutee in Linkoping) the PELAB group had acquired a Parsytec parallel computer.. Since it would be nice if somebody used it for something, Jonas implementedd a version of Litwin's LH* named LH*LH. In an academic en- vironment,, however, it is not enough to just do things, it also has to be publishedd at conferences. It is not as bad as it might sound since confer- encess typically occur in countries far away, in combination with vacation. Thiss was something Jonas had not imagined at the time. Tore sent Jonas to Pariss to write the publication together with Witold. In total there was two hecticc weeks writing the paper. Litwin scrutinized the evolving paper. The paperr [KLR96] was sent away to a conference. There was a summer. Staffan Larsonn a colleague had a publication in the Very Large Databases conference inn Zurich and Jonas was sent along. Switzerland is expensive, having a fast foodd snack was like going to a restaurant in Sweden, in price at least. The conferencee was enjoyed. Most of the talks were very interesting, and gave highh inspiration. During the conference dinner Tore seated us at the table off Martin Kersten and other Dutch researchers. Tore whispered that Martin hadd some interest Jonas' work. There was some interesting discussions, and aa loose invitation for a visit. After the conference Jonas went by train to Romee and flew home. Now he had incentive to do some research. Thee conference was Extending Database Technology and it was held in

15 5 16 6

Avignon,, France. Jonas was presenting his paper on LH*LH. At the con- ferencee Wilko Quak and Peter Boncz Jonas' to become colleagues (all unknowingly). . Afterr the conference Jonas met up with Tore Risch in and they drovee to Schloss Dagstuhl1 They had been invited to participate in seminar onn Performance Enhancement in Object Bases. In one of the nights while intoxicationn took place in the wine cellar, Guido Moerkotte and Bernhard Seegerr inspired Jonas to do further work on scalable distributed data struc- tures.. Eventually, this resulted in both the hQT* paper and the fi - storage structure. . Inn the evening when Martin Kersten had arrived, again in the wine cellar, hee involved Jonas into a discussion of how to implement efficient database systems,, Peter was backing up Martin. Martin proposed Jonas to come and stayy some weeks. Rumors retain that this is a non-forgotten event. Jonas wass overwhelmed, however it could take some time, since he wanted to keep thee promise (for himself) to finish the Licentiate Thesis. Welll at home, after an eastern in Heidelberg, he set out to finish the work. Nott much later an official invitation arrived asking for the dates. First, it wass planned to be a visit just before X-mas. Tore knowing that writing a thesiss do take time suggested that the visit be made during the spring. Plans weree made for mid-February and a period of 4 months. Inn Amsterdam Jonas was provided with a studio-apartment at the distin- guishedd Prinsengracht. The house belonged to a Martita Wiessing "elderly artisticc lady" according to the description Jonas was told. The house is a typicall Amsterdam house. Martita runs a gallery, or maybe the other way around.. It was a very pleasant atmosphere with interesting people at late hours. . Inn a spring-trip back to Sweden, the Licentiate Thesis was finally pre- sented.. At this point Jonas was not sure if he wanted to go on with a PhD.. One important reason is that a PhD:s was considered "a to high ed- ucation"" in the Swedish industry a licentiate was just appropriate. July, presentedd Jonas with the choice of staying and finishing the PhD studies in Amsterdam.. So far the life in Amsterdam had provided the time to think overr the situation, and it yielded the decision that this is once-in-a-lifetime chance.. Furthermore, for working abroad a PhD opens some doors (and checkbooks).... So he signed up for another 2.5 years. Afterr a while Jonas moved to another central area the Nieuwmarkt- buurt.. Binnen Bantammerstraat to be more exact. For those who does not knoww it this is the original "Chinatown" street in Amsterdam, however, noww most of the of the businesses moved to another street. Only 2 Chinese restaurantss remain, otherwise the street has four bars, three which serve !Att a workshop Performance Enhancement in Object Bases http://www.dag.uni-sb.de/DATA/Participants/9614.html. . 17 7 somee food, one tai restaurant, one hairdresser, one violin-bow-builder, one winee store, one dentist, one fruit shop, and a Porche shop. A later addition too the diversity of the street was just below my apartment a tatto shop. Alll this in a hundred meter long street! Inn the beginning the work was concentrated on showing the feasibility of addingg an SDDS to the Monet database system. Differently intrusive imple- mentationn strategies were chosen from and one was selected. Experiments weree performed and a paper was accepted to a parallel conference in the city off Las Vegas a crazy city. A word of advice do not stay there for longer thann a week the sound of these machines... Still with some money in the pockett he returned home. Work began on a spatial structure, inspired by the oftenn occuring sudden rainfall distress. However, this one would not stop. At homee in the sofa, Jonas read papers and sketched ideas for a 2-dimensional SDDSS structure later to be named hQT*. This structure ended up in a con- ferencee in Kobe, Japan FODO. For Jonas Japan was a overwhelmingly interestingg experience, but exhausting. Well at home plans started for the PhDD Thesis, only to be intervened by another structure £7-storage. The structuree was inspired by having a 24 processor machine with 48 GB of sharedd main memory. The paper was accepted to the Australian Database Conference,, and presented in January 2000. Now,, the reality has catched up with the story. You are hopefully reading ann approved and printed PhD Thesis. It collects the published work on thee matters that Jonas pursued in his research. Jonas hopes it provides interestingg reading. Jonass was latest sighted in Silicon Valley. Someone provided a ru- mourr that he'd been gobbled up by the giant IBM. It is being said thatt he's building itsy bitsy database systems for small devices like the Palm/WinCE/embeddedd linux. Thee story will be continued, but this is outside the scope of this docu- mentt ... Anonymous s

Partt I

Scalablee Distributed Data Structures s

Chapterr 1 Preliminaries s

Wee start out by giving an introduction to SDDSs followed by a description of LH*,, which generated this track of data storage structures. We will motivate theirr existence and point out some possible application areas, subsequently wee describe common (basic) data structures (known to most readers) their behaviorr and properties in perspective of SDDSs. This defines the terminol- ogyy and it allows us later to describe design choices more accurately.

1.11 Birthground of SDDSs

Inn traditional distributed files systems, in implementations like NFS or AFS, aa file resides entirely at one specific site. This presents obvious limitations. Nott only on the size of the file, but also on the access performance scalabil- ity.. To overcome these limitations, distribution over multiple sites has been used.. One example of such a scheme is round-robin [Cor88] where records off a file are evenly distributed by rotating through the nodes when records aree inserted. The hash-declustering method of [KTM084] assigns records to nodess on the basis of a hashing function. The range-partitioning method of [DGG+86]] divides key values into ranges and different ranges are assigned too different nodes. A common aspect of these schemes is their static behav- ior,, which means that the declustering criterion does not change over time. Hence,, updating a directory or declustering function is not required. The pricee to pay is that the file cannot expand over more sites than initially allocated. . Too overcome this limitation of static schemes, dynamic partitioning is used.. The first such scheme is DLH [SPW90]. This scheme was designed for aa shared-memory system. In DLH, the file is in RAM and the file parameters aree cached in the local memory of each processor. The caches are refreshed selectivelyy when addressing errors occur and through atomic updates to all thee local memories at some points. DLH appears impressively efficient for highh insertion rates.

21 1 22 2 CHAPTERCHAPTER 1. PRELIMINARIES

1.22 SDDSs

SDDSss were proposed for distributing files in a network multi-computer environment,, hence without a shared-memory. The first scheme was LH** [LNS93]. Distributed Dynamic Hashing (DDH) [Dev93] is another SDDS,, based on Dynamic Hashing [Lar78]. The idea with respect to LH* iss that DDH allows greater splitting autonomy by immediately splitting overflowingg buckets. One drawback is that while LH* limits the number of forwardingss to two1 when the client makes an addressing error, DDH may usee 0(log2 N) forwardings, where iV is the number of buckets in the DDH file. file. [WBW94]] extends LH* and DDH to more efficiently control the load of aa file. The main idea is to manage several buckets of a file per server while LH** and DDH have basically only one bucket per server. One also controls thee server load as opposed to the bucket load for LH*. Bothh [KW94] and [LNS94] propose primary key ordered files. In [KW94] thee access computations on the clients and servers use a distributed binary searchh tree, whereas the SDDSs in [LNS94], collectively termed RP*, use broadcastt or distributed n-ary trees. It is shown that both kinds of SDDSs alloww for much larger and faster files than the traditional ones.

1.33 Requirements from SDDSs

SDDSss (Scalable Distributed Data Structures) such as a distributed vari- antt of Linear Hashing, LH* [LNS96], and others [Dev93][WBW94][LNS94], openss up new areas of storage capacity and data access. There are three requirementss for an SDDS:

First, it should have no central directory to avoid hot-spots.

Second, each client should have some approximate image of how data iss distributed. This image should be improved each time a client makes ann addressing error.

Third, if the client has an outdated image, it is the responsibility of thee SDDS to forward the data to the correct data server and to adapt thee client's image.

SDDSss are good for distributed computing since they aim at minimizing the communicationn and in turn minimize the response time, and enable more efficientt use of the processor time. Inn light of LH [Lit80] and LH* [LNS96] the following terms are used: Thee data sites termed servers can be used from any number of autonomous sitess termed clients. To avoid a hot-spot, there is no central directory for 1Inn theory, communication delays could trigger more forwarding [WBW94], 1.4.1.4. DATA STRUCTURES BACKGROUND 23 3 thee addressing across the current structure of the file. Each client has its ownn image of this structure. An image can become outdated when the file expands.. The client may then send a request to an incorrect server. The serverss forward such requests, possibly in several steps, towards the correct address.. The correct server appends to the reply a special message to the client,, called an Image Adjustment Message (IAM). The client adjusts its image,, avoiding repetition of the error. A well-designed SDDS should make addressingg errors occasional and forwards few, and should provide for the scalabilityy of the access performance when the file grows. A typically SDDS scenarioo has many more clients than servers, and that clients are reasonably active,, i.e., a hundred or more interactions in the lifetime of a client.

1.44 Data Structures — Background

InIn this section we give a short overview on commonly used indexing data structuresstructures for indexing. We start out by introducing desirable features of suchsuch data structures. In a distributed scenario using SDDSs these desired propertiesproperties are of even higher importance. The workings of distributed data structuresstructures are not the main topic in this section, thus it can be skipped by thethe expert on data structures.

Dataa in DBMSs is organized using Data Structures, also referenced to as accesss structures, access paths, accelerators, indices, or indexing structures. Wee identify three important properties for a "good" data structure. The firstt is that the application accesses to the the individual elements encoded in thee data structure should be fast, i.e. insertion or retrieval should be efficient (access(access overhead). Secondly, the storage overhead, i.e. the extra storage space neededd for organizing the data and improve the access speed should be low. Third,, the data structure should be able to handle the amount of data that iss needed scalable, i.e. the structure should dynamically adapt to different storagee sizes without deteriorated performance.

1.4.11 Retrieval Methods Dataa is structured in records containing fields (attributes), i.e, bank account information.. Some fields are the target of retrieval called search keys. The searchh method depends on the inherit search characteristics which can be classifiedd along the lines.

Key retrieval (lookup)

Range retrieval (domain specific)

Approximate retrieval (sub-string, soundex) 24 4 CHAPTERCHAPTER 1. PRELIMINARIES

Predicate search (filtering)

Multi-dimensional searches (point, spatial, space, nearest)

Information retrieval (probabilistic)

Wheree some applications access data using one key, others may need too retrieve data in a certain range, which leads to range retrieval Approx- imateimate search is another type of retrieval, often user specified allowing for matchingg under a similarity measure. A special case of proximity is sub- stringg search, for example search addresses whose names contains the string "city";; Soundex search allows one to search for names of people that sounds likee "John" (Jon, John, Jonny, Jonni, Johnnie, Johnny). Soundex searching iss efficiently implemented by mapping the search key to a normalized (sound invariant)) spelling representation. Otherr kinds of retrieval might consider several fields simultaneously, of- tenn referred to as different attributes or dimensions. Example of some di- mensionss are spatial (x, ^-coordinate or in a common database it would bee Zip-code, Age, and Income. A multi-dimensional indexing structure al- lowss retrieval of data using several of these keys (dimensions) at the same time.. A more general case is predicate search, which allows the program to specifyy an arbitrary predicate, which, when invoked on the data, returns true/falsee value. If the predicate yields true the data is returned to the user/application. . InformationInformation retrieval sciences are not that strict, and employs aa scoring function which scores the data returning the the "best matches"" ranked with he best match first. Web search engines, such as http:: //www. altavista. com/, employs various searching and scoring meth- ods. .

1.4.22 Reasonable Properties

AA reasonable well-behaved and efficient data structure can be expected to fulfilll most of the following statements.

A data structure is a container that stores n items of data.

Each item is identified by its key(s) (algorithms typically assumes that thee keys are unqieue but in practice it is often relaxed).

A single insert of an item should ideally be done in constant time, but typicallyy O(logn) time is acceptable.

Lookups using unique keys is expected to be faster or at least exhibit thee same cost as inserts.

Iteration over all items in the data structure takes 0(n). 1.4.1.4. DATA STRUCTURES BACKGROUND 25 5

If the data structure supports ordered retrieval, i.e. previous and nextt operations, these are expected to take 0(1) time. Prom this fol- lowss that, starting at the first item applying the next operation on succeedingg items until the last is reached, should take 0(n). Localiza- tionn is then Ö(nlogn) at worst.

Complex querying ("searching") the data structure returning r items, optimallyy takes 0(r) time. Worst case, however, is 0(n). An effective structuree might perform pruning and achieve 0(r + logn). The effec- tivenesss of a search can be expressed by the overhead t/r, where t is thee number of tests, together with the pruning factor of (n - t)/n.

Most data structures are in practice limited by available memory, disk pagee sizes, etc. If the data structure can dynamically restructure itself andd keep the performance, then we say it is dynamic. Then, theoreti- cally,, it has no upper limit on the amount of items it can handle.

However, many dynamic structures deteriorate on skewed data, creat- ingg a structure where most of the data is stored in few places and the insertt and retrieval operations deteriorate in performance. The cause mayy be that the partitioning function used does not perform well or thatt the input data appears in a non-convenient order. Some structures balancesbalances themselves to avoid the deterioration problem.

A non-dynamic structure can be replaced by another non-dynamic structure,, which can hold a larger data set and that avoids the deteri- oration/skeww experienced. This was the classic way to achieve dynamic structuress (rebuilding, typical variants are Dynamic Array implemen- tations,, rehashing of hash-structures). One problem with this approach iss that in some cases it might require the same time for re-inserting all thee data stored in the previous structure as well as the double amount off space for that time. We will refer to structures which gracefully ac- commodatee these problems as being scalable. Examples include Linear Hashingg [Lit80], Dynamic Hashing [Lar78], and B-tree [BM72].

1.4.33 Basic Data Organization Thee simplest, and most common storage structure is the array, in most programmingg languages predetermined in size and type of data they store. Thee elements of an array are generally stored continuously in memory and accessedd using an implicit calculated index. Retrieving data is easy when thee position is known or when it can be calculated easily. When data can be retrievedd by direct look-up in an array structure we call it radix retrieval or directdirect addressing. Whenn the data is stored in a sorted array, we can use binary search too locate the exact position of the data requested. It works by iteratively 26 6 CHAPTERCHAPTER 1. PRELIMINARIES

Structure/Methods s Key y Ordered d Insert t Lookup p Search h Memory y Array y l..n l..n -- 1 1 1 1 0{n) 0{n) 0(n) 0(n) Array y l..n n Y Y 0(n) 0(n) n/2 2 0{n) 0{n) 0(n) 0(n) List t atom m -- 1 1 0(n) 0(n) 0(n) 0(n) 0(n) 0(n) Hashing g atom m -- 0(1) ) 0(1) ) 0(n) 0(n) 0(n) 0(n) Tree e atom m -- 0(log(n)) ) 0(log(n)) ) 0(n) 0(n)0(n0(n * log(n)) Heap p id d -- 0(1) ) 0(1) ) -- 0{n) 0{n)

Tablee 1.1: Basic data structures features and complexity.

halvingg the array, choosing the half which would contain the key searched for.. This approach achieves O(logn) search time on uniform spaces. Even so,, accesses might be slow when searching large volumes of data, because of thee numerous comparisons to be made. Iff the ordering of the keyed data is not important one can employ a hash-structure.hash-structure. Hashing structures are commonly based on an array, where inn each position in the array, called a slot, one item of data can be stored (closedd hashing) [Knu]. The position of a record in the array is determined byy a hash-function which calculates a natural number using the key. The numberr is modified in such a way that it fits into the interval range of the array,, effectively reducing the problem to radix retrieval. Whenn two items hash to the same location a conflict resolution method is applied,, it can involve rehashing using another hash-function or just stepping throughh the array systematically looking for an empty slot.

1.4.44 Memory Management/Heaps Dataa of varying sizes are typically stored and managed by employing a heapheap structure. Examples include varying-length strings, pictures, sound, andd memo-fields in a DBMS. The heap structure manages an area of this memory,, which may be preallocated to the application. It keeps tracks of thosee areas of memory that are in use and which are not. An area of memory iss allocated to store the data. It will occupy somewhat more memory (storage overhead).overhead). When the data is not needed anymore, it can be given back to thee heap by deallocating/freeing it, for later reuse. Heapss for main memory management experience problems with heap fragmentation,fragmentation, where allocated memory blocks of varying sizes are not stored inn adjacent memory locations, i.e., a situation where a request for mem- oryy cannot be granted because there is no single continuous block of mem- oryy available. This problem is traditionally attacked by Garbage Collection methods,, which compacts the used memory, removing the "holes" of unused memory.. Other methods employs more elaborate allocation schemas. 1.4.1.4. DATA STRUCTURES BACKGROUND 27 7

1.4.55 Linked Lists Thee drawback of array storage is that its size often is predefined at compile time.. When more data is stored the array overflows, programs may mal- functionn or even crash with disastrous effects. Therefore dynamic allocation off memory and data storage is essential to match the current need of the application.. A linked list can be viewed as a number of items linked together intoo a chain by storing additional information, pointers, which for each link pointss to the "next" item. Searching for a key in a long lists is "slow", since it,, in the basic configuration, requires scanning the list from the beginning tilll the matching data is found. However, a list has the advantage that items easilyy can be added or removed without having to move other items. The mainn advantage with a linked list implementation is that inserts does not needd to move any data.

1.4.66 Chained/Closed Hashing Linkedd lists are often used in combination with hashing, to allow every radix positionn (slot in the array) to store several data items, linked together into aa linked list. Every such slot is then said to point to a Bucket. Another possibilityy to implement a bucket is to associate a memory segment/disk pagee with a slot in the hashing array. Hashing is analyzed in [Knu]. Still,, if the volume of the inserted data is very large, much larger than foreseenn for the hashing array, the problem again deteriorates to searching a linkedd list, since too many items are stored per slot. Normally, the average numberr of items stored per slot is limited. When this limit is exceeded, a neww larger array is created and all the items are moved (inserted) into the neww hash structure.

1.4.77 Trees TreeTree structures, allow efficient inserts, deletes and retrievals. A tree contains twoo type of nodes, branch-nodes which are used for organizing the data, and leaf-nodesleaf-nodes which stores the data. A branch-node is a choice-point, where a choicee is made between a number of branches. In the simplest case where eachh node has two branches, a node can be characterized by a value. Data withh a lower value is stored in the sub-tree found by following the left hand branch,, higher values are found in the left-hand branch in the same way. Thee root-node is the "first" node of the tree. By navigating, starting at the roott node, and traversing branch-nodes a leaf-node is eventually reached. The,, on the avererage, best performing tree under uniform access dis- tributions,, is balanced tree, in which all leaf-nodes are at the same distance fromm the root-node. Such a tree allows inserts and lookups in O(logn) time. Inn a balanced tree, all the branches (including their reachable subtrees) off a node has the same weight. The "weight" is the number of items that are 28 8 CHAPTERCHAPTER 1. PRELIMINARIES reachablee through the branch. The worst case scenario for a non-balancing treee is where one of the branches recursively has the largest weight. In such aa skewed tree, for example a binary tree, each branch-node would have one leaf-nodee and another branch-node. The distance to the root for the last leaf-nodee is then linear to the number of items, giving the search time 0(n). However,, with random ordered input data, it is highly unlikely that the treee deteriorates this much, and on the average, the navigation is said to be O(logn). . Onee reason that non-balancing trees have such a bad worst case per- formancee is that they do not dynamically adjust the subtrees of the nodes. Insteadd the nodes split criteria was fixed when the node was created. Thee AVL-tree is a binary balancing tree. It dynamically adjusts the branchh nodes, so that all leaves are kept at roughly the same distance from thee root. B-trees are another example of a balancing tree, but it allows tuning off the number of branches for the nodes. B-trees are popular for disk-based storagee and DBMSs. Quad-treesQuad-trees [Sam89] allow for spatial data, i.e. data points with a coordi- natee (x,y). There are several variants of Quad-trees, some allows dynamic choicee of split-value for a branch-node, and others where this value is pre- determined.. The dynamic choice can lead to highly skewed trees, depending onn insertion order. Whereas, the predetermined variant may create unnec- essarilyy deep sub-trees, for highly clustered data. Treess that are invariant of the insert order, i.e., independent on the insertt order of the items always yield a tree with the same structure. One examplee is a quad-tree using bits for its organization [Sam89]. In such a tree thee splitting criteria is predetermined. Each node's splitting criteria solely dependss on its position in the tree (height and nodes above). These trees cann experience problem with clustered data yielding deep branches, these can,, however, be efficiently be compacted [Sam89].

1.4.88 Signature Files AA signature file [FBY92] works as an inexact filter. It is mainly used in Infor- mationn Retrieval to index words in documents, but can be applied to other dataa items successfully too, such as time-series data [Jön99]. For each docu- mentt a number of signatures are stored in a file, each signature is viewed as ann array of bits of fixed size. Typically, each releveant word in the document iss hashed to a bit in the signature, setting this bit to 1, but other coding schemess can also used. A signature is often choosen so that approximately halff of the bits are set. To search for a word, all signatures are scanned. For signaturess that match, it is probable that the corresponding document con- tainss the word searched for. The document can then be retrieved and tested. Somee signatures matches even though the document does not contained the wordd searched for, these matches are called "false hits". A signature typi- 1.5.1.5. ROUNDUP 29 9 callyy consists of hash-coded bit patterns. Scanning of signature files is much fasterr than scanning the actual documents. Still, it is only a few orders of magnitudess faster. The space overhead is typically chosen to be 10% to 15%.

Structure/Methods s Positive e Negative e Array y ++ compact -- ordered inserts — 0(n) -- fixed size -- slack = 0(N - n) Dynamicc Array 4-- compact -- indirection overhead 4-- dynamic 4-- limited slack Memoryy Heap ++ dynamic sizes of data -- O(n) overhead ++ O(l) store/retrieval using "handle" -- deterioates w usage -- no search possible Linkedd List ++ dynamic -- O(n) overhead ++ insert in O(l) time -- non compact -- O(n) searches Openn Hashing ++ "O(l)" access time -- fixed size -- complex collision handling Closedd Hashing ++ "O(l)" access time -- fixed size -- buckets w slack/linked list/array Dynamicc Hashing ++ "O(l)" search time -- accumulated O(logn) insert time (LH) ) -- complex implementation Tree e ++ dynamic -- skew gives O(n) in worst case -- storage overhead n + rt/2. -- insert order sensitive Balancedd Tree 4-- dynamic -- rearranging cost {dynn array) ++ guaranteed O(logn) retrieval Signaturee Files ++ fast "O(l)" insert -- slow O(ti) search time ++ approximate -- no order -- storage overhead n

Tablee 1.2: Positive and negative properties of basic data structures.

1.55 Roundup

Tablee 1.2 displays a list of data structures and what I see as their most positivee and negative properties. The list is in no way complete, but it is providedd as summary of the discussions in this chapter.

1.66 LH* (1 dimensional data)

WeWe start out describing LH* [LNS93] the first full SDDS designed. LH* defineddefined the basics for SDDSs and inspired me and many others to boldly createcreate and explore areas where no man has gone before.

Wee will now describe the LH* SDDS, and later on we describe LH*LH. LH** is a data structure that generalizes Linear Hashing to parallel or distributedd RAM and disk files [LNS96]. One benefit of LH* over ordinary LHH is that it enables autonomous parallel insertion and access. The num- berr of buckets and the buckets themselves can grow gracefully. Insertion requiress one message in general and three in the worst case. Retrieval re- quiress at least two messages, possibly three or four. In experiments it has beenn shown that insertion performance is very close to one message (+3%) 30 0 CHAPTERCHAPTER 1. PRELIMINARIES andd that retrieval performance is very close to two messages (+1%). The mainn advantage is that it does not require a central directory for managing thee global parameters.

1.6.11 LH* Addressing Scheme Ann LH*-client is a process that accesses an LH* file on the behalf of the application.. An LH*-server at a node stores data of LH* files. An application cann use several clients to explore a file. This way of processing increases the throughput,, as will be shown in Section 2.6. Both clients and servers can be createdd dynamically.

ve e off hash function :orward d 33 7^\ 7^ // forwarii' \ 22 -

11 /*\ /\ /\\ T^K^'Niy^K"5 5 f-%r* f-%r* ^^ V «JJ- : ((: \ /"" "\ /-- ^ rr > rr \ ff \ >» »0 0 '3* * 1 1 2 2 --> 3 3 » » 4 4 -3>- 5 5 > 6 6 ï* 1 1 AA 8 V V

n n ylyl s' SS ' s'"" N Insert t DataServers s IAM M

DataClient t

levell = ] Client image pointerr = 0

Figuree 1.1: LH* File Expansion Scheme.

Att a server, one bucket per LH* file contains the stored data. The bucket managementt is described in Section 2.2. The file starts at one server and expandss to others when it overloads the buckets already being used. Thee global addressing rule in LH* file is that every key C is inserted to thee server sc, whose address s = 0,1, ...N - 1 is given by the following LH addressingg algorithm [Lit94]:

scsc := hi(C)

iff sc < then sc := hi+i(C), wheree i (LH* file level) and n (split pointer address) are file parameters evolvingg with splits. The hi functions are basically:

hi{C)hi{C) = C mod (21 xK),K= 1,2,. 1.6.1.6. LH* (1 DIMENSIONAL DATA) 31 1

andd K — 1 in what follows. No client of an LH* file knows the current i andd n of the file. Every client has its own image of these values, let it be i' andd n'; typically i' < % [LNS93]. The client sends the query, for example the

insertt of key C, to the address s'c(i',n').

Thee server s'c verifies upon query reception whether its own address s's'cc is s'c = sc using a short algorithm stated in [LNS93]. If so, the server processess the query. Otherwise, it calculates a forwarding address s"c using thee forwarding algorithm in [LNS93] and sends the query to server s"c. Server s"s"cc acts as s'c and perhaps resends the query to server s'^ as shown for Server 11 in Figure 1.1. It is proven in [LNS93] that then s£' must be the correct server.. In every case of forwarding, the correct server sends to the client ann Image Adjustment Message (IAM) containing the level i of the correct server.. Knowing the i and the SQ address, the client adjusts its i' and n' (seee [LNS93]) and from now on will send C directly to SQ.

1.6.22 LH* File Expansion Ann LH* file expands through bucket splits as shown in Figure 1.1. The next buckett to split is generally noted bucket n, n = 0 in the figure. Each bucket keepss the value of i used (called LH*-bucket level) in its header starting fromm i — 0 for bucket 0 when the file is created. Bucket n splits through the replacementt of ht with hl+\ for every C it contains. As a result, typically half off its records move to a new bucket AT, appended to the file with address nn + 2'. In Figure 1.1, N = 8. After the split, n is set to (n + 1) mod 2*. Thee successive values of n can thus be seen as a linear move of a split tokentoken through the addresses 0,0,1,0,1,2,3,0,..., 2* - 1,0,... The arrows of Figuree 1.1 show both the token moves and a new bucket address for every split,, as resulting from this scheme.

Splittingg Control Strategies Theree are many strategies, called split control strategies, that one can use too decide when a bucket should split [LNS96] [Lit94] [WBW94]. The overall goall is to avoid the file overloading. As no LH* bucket can know the global load,, one way to proceed is to fix some threshold 5 on a bucket [LNS96]. Buckett n splits when it gets an insert and the actual number of objects itit stores is at least 5". S can be fixed as a file parameter, but a potentially betterr performance strategy is to calculate 5 for bucket n dynamically using thee following formula: SS = M x V x =—, 22l l wheree i is the n-th LH*-bucket level, M is a file parameter, and V is the buckett capacity in number of objects. Typically one sets M to some value betweenn 0.7 and 0.9. 32 2 CHAPTERCHAPTER 1. PRELIMINARIES

Thee intuition behind the formula is as follows. A split to a new server shouldd occur at each M x V global insert into the data structure, thus aiming att keeping the mean load of the buckets constant;

globall number of inserts/number of server = constant.

Forr a server without any knowledge about the other servers it can only use itss own information, that is, its bucket number n and the level i, to estimate thee global load. It knows that any server < n, server 0..n - 1, has split into serverr 2\.2l -fn — 1 and both these thus have half the load of the servers that aree not yet split, servers n..2l - 1. The number of servers can be calculated too 2X + n, which gives us an estimated global load of

MM x V x (2* + n).

Serverss that were split or new servers have half the load, 5/2, of those that aree to split that have the load S. The n new servers come from n servers, totallyy 2xn servers with the load S/2, and 2t + n - 2 x n remaining servers too be split later with a load of S. The total of these servers can then be expressedd as \\ xS x 2 x n + S x (2l - n). Thiss can be simplified to 5 x 2'. Setting the global estimate equal to the lastt expression provides after some simplification

MM x V x (2* + n) = S x 2\

Solvingg for S gives the above expressed formula for S. Thee performance analysis in Section 2.6.1 shows indeed that the dynamic strategyy should be preferred in our context. This is the strategy adopted for LH*LH. .

1.6.33 Conclusion LHH is well-known for its scalability in handling a dynamic growing dataset andd the new distributed LH* is also proven scalable. Both of these hashing algorithmss use the actual bit representation of the hash values; these are givenn by the keys. Hashing in general can be seen as a radix sort in an intervall where each value has a bucket where it stores the items. LH can in turnn be viewed as a radix sort using the lower bits of the hash value for the keys.. It furthermore has an extra attribute that tells us the number of bits used,, and a splitting pointer. The splitting pointer allows gradual growth andd shrinkage of the range of values (number of buckets) used for the radix sort. . LH** is a variant of LH that enables simultaneous access from several clientss to data stored on several server nodes. One LH bucket corresponds 1.7.1.7. ORTHOGONAL ASPECTS 33 3 too the data stored on a server node. In spite of not having a central direc- tory,, the LH* algorithm allows for extremely fast update of the client's view soo that it will access the right server nodes when inserting and retrieving data.. LH* [LNS93] was one of the first Scalable Distributed Data Structure (SDDSs).. It generalizes LH [Lit80] to files distributed over any number of sites.. One benefit of LH* over LH is that it enables autonomous parallel insertionn and access. Whereas the number of buckets in LH changes grace- fully,, LH* lets the number of distribution sites change as gracefully. Any numberr of clients can be used; the access network is the only limitation for linearr scaleup of the capacity with the number of servers, for hashed access. Inn general, insertion requires one message, and in the worst case three mes- sagess might occur. Retrieval requires one more message. But the main issue iss that no central directory is needed for access to the data.

1.77 Orthogonal Aspects

Inn this section we list important properties for the data structures studied in thiss thesis. These properties should ultimately be independently availiable forr data storage. In practice this is not the case. For example, distribution or parallelismm gives better performance but generally decreases the availability. Moree dimensions give more overhead and/or worse performance.

1.7.11 Performance Itt is desired that single lookup/insert operations can be performed in "con- stantt time", however, in practice O(logn) usually suffices. When one or more parameterss are varied for a data structure, such as dimensions, distribution, availability,, or communication topologies, they will inevitably affect perfor- mance.. Disk I/O, as well as, cache-misses in RAM, should be avoided. In somee cases the actual CPU cycles may be of importance in common opera- tions,, such as scanning arrays of data.

1.7.22 Dimensions Forr classic data structures, only one-dimensional data is allowed. That is, onee key is used for retrieval and inserts. Twoo dimensions are also fairly well covered by literature. Many struc- turess combine the x and y values into one value, and use this value for index- ingg in a classic one-dimensional data structure. Some structures are based onn order preserving hashing, interleaving the x and y binary representations too form another value, a value later used in indexing a one-dimensional hash-filee (referred to as multi-dimensional hashing), or building a quad-tree stylee structures. Common operations on spatial (2-dimension) data struc- 34 4 CHAPTERCHAPTER 1. PRELIMINARIES turess involve point lookup, region retrieval, closest neighbors, or similarity retrievall [Sam89]. Itt is a well-known fact that most multi-dimensional data structures will sufferr from the multi-dimensional curse [WSB98], the performance degrades manyy orders of magnitude when the dimensionality increases. For similarity retrievall it has been observed [WSB98] that it is better to perform scanning overr the whole dataset, or using a compact signature file, than to try to use a multi-dimensionall data structure. It was shown that their scanning method alreadyy at 13 dimensions outperformed known efficient multi-dimensional dataa structures.

1.7.33 Overhead Dataa structures allow for efficient indexing, in order to accelerate retrieval. However,, storage overhead is not negliable in most indices, B-trees may storee duplicate keys in the internal nodes, and hashing structures often has somee slack to avoid worst cases. The storage overhead (unused space) is typicallyy around 50% for a B-tree, 10-15% for a hash structure. More so iss the performance affected by the CPU usage for navigating the index, or thee cost of calculating hash-keys. The compactness and locality in the mem- oryy navigation is another important concern on CPUs with large internal cachess [BMK99].

1.7.44 Distribution and Parallelism Too handle very large amounts of data, distribution or parallelism is tra- ditionallyy employed. Distribution of data, however, adds additional storage overhead,, often more calculations, as well as communication messaging. Par- allelismm using shared memory is limited by hardware architecture scale-up limitss but it avoids costly messaging using other means of synchronizations. Thee more data stored at more sites incurs more messages and more overhead inn accessing and organizing the data as well as processing it. B-treess and hash structures are preferred for creating distributed indices. Theyy can be used to administer and automatically decluster the data set over aa number of nodes. However, they are often static in their structure, allow- ingg only limited load balancing and perform poorly when the presumptions change. . SDDSs,, are in a way the "balancing" distributed data structures. They generallyy allows for retrievals in "near constant" time 0(1) ... O(logn), or ratherr a "near constant" number of messages on the average. The perfor- mancee for SDDSs is often assessed by simulations that counts the number off serial messages needed for data to be found or inserted. For example LH** [LNS93] has been reported to allow retrieval of a data distributed over hundredss of node, in less than 2.001 messages on average [LNS93]. Further- 1.7.1.7. ORTHOGONAL ASPECTS 35 5 more,, LH* limits the number of messages need for retrieval to 4. Other SDDSss do not guarantee an upper bound, but instead offers acceptable av- eragee performance.

1.7.55 Availability Availabilityy means that data can be made availiable when the it is needed, Itt may involve reconstruction of actual data by applying logs, or combining partiall replicas. Inn real-time systems, data structures are designed to give guaranteed performance,, both bounded in time as well as availability. Many "dynamic" structuress are more vague, quoting average performance values. Disk based dataa may be cached in main-memory or require random disk accesses, and mayy be delay, because other user accesses to the same disk. Inn the event of distributed systems, the task is even more difficult. Stor- agee nodes may be unavailable at times, because of hardware or software faults,, or network congestion. Some specialized networks are designed to be ablee to give promises about the performance (ATM). Common solutions for achievingg high availability use techniques such as RAID storage [PGK88], replication,, logging and hot standby, and failure recovery [Tor95]

Chapterr 2

Thee LH*LH Algorithm

TheThe chapter is based on material from the article on LH*LH [KLR96] pub- lishedlished in the lecture notes from the Fifth International Conference on Ex- tendingtending Database Technology, in Avignon, France 1996. It was extended and publishedpublished as part of the Licentiate Thesis [Kar97j. DetailsDetails of the design and implementation of LH*LH is provided. The proto- typetype is targeted on the Parsytec machine (Section 2.5). However, the LH*LH implementationimplementation can be used in any threaded (network) multi-computer.

2.11 Introduction

Wee present the LH*LH design and performance. With respect to LH* [LNS93],, LH*LH is characterized by several original features. Its overall ar- chitecturee is geared towards an SM (Switched Multi-computer), while that off LH* was designed for a network multi-computer. Furthermore, the design off LH*LH involves local bucket management, while in [LNS93] this aspect off LH* design was left for further study. For this purpose LH*LH a modified versionn of main-memory Linear Hashing as defined in [Pet93] on the basis off [Lar88] and LH [Lit94]. An interesting interaction between LH and LH* appears,, allowing for much more efficient LH* bucket splitting. The reason is thatt LH*LH allows the splitting of LH*-buckets without visiting individual keys. . Thee average access time is of primary importance for any SDDS on a networkk computer or SM. Minimizing the worst case is, however, probably moree important for an SM where processors work more tightly connected thann in a network computer. The worst case for LH* occurs when a client accessess a bucket undergoing a split. LH* splits should be infrequent in practicee since buckets should be rather large. In the basic LH* schema, a client'ss request simply waits at the server till the split ends. In the Parsytec

37 7 38 8 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

context,, performance measurements show that this approach may easily lead too several seconds per split, e.g. three to seven seconds in our experience (as comparedd to 1 - 2 msec per request on the average). Such a variance would bee detrimental to many SM applications. LH*LHH is therefore provided with an enhanced splitting schema, termed concurrentconcurrent splitting. It is based on ideas sketched in [LNS96] allowing for the client'ss request to be dealt with while the split is in progress. Several con- currentt splitting schemes were designed and experimented with. Our perfor- mancee studies shows superiority of one of these schemes, termed concurrent splittingg with bulk shipping. The maximal response time of an insert while aa split occurs decreases by a factor of three hundred to a thousand times. Ass we report in what follows, it becomes about 7 msec for one active client inn our experience and 25 msec for a file in use by eight clients. The latter valuee is due to interference among clients requesting simultaneous access to thee server splitting. Thee first implementation of LH* was performed using the Parallel Vir- tualtual Machine software, PVM [MSP93], on a number of HP workstations. Thee reason was mainly that the Parsytec machine in Linköping at that mo- mentt was newly installed and quite unstable, thus unavailable most of the time.. Later, this has partly influenced the implementation in such a way thatt library primitives dealing with hardware or environment specifics have beenn abstracted in an almost transparant way. LH*LHH allows for scalable RAM files spanning over several CPUs of an SMM and its RAMs. On our testbed machine, a Parsytec GC/PowerPlus with 644 nodes of 32 MB RAM each, a RAM file can scale up to almost 2 GBytes withh an average load factor of 70%. A file may be created and searched by severall (client) CPUs concurrently. The access times may be about as fast ass the communication network allows it to be. On our testbed, the average timee per insert is as low as 1.2 ms per client. Eight clients building a file concurrentlyy reach a throughput of 2500 inserts/second i.e., 400 /is/insert. Thesee access times are more than an order of magnitude better than the best oness using current disk file technology and will probably never be reached byy mechanical devices.

2.22 The Server

Thee server consists of two layers, as shown in Figure 2.1a. The LH*-Manager handless communications and concurrent splits. The LH-Manager manages thee objects in the bucket. It uses the Linear Hashing algorithm [Lit80].

2.2.11 The LH Manager

LHH creates files able to grow and shrink gracefully on a site. In our implemen- tation,, the LH-manager stores all data in the main memory. The LH variant 2.2.2.2. THE SERVER 39

Serverss & Clients A A I I LH** Manager Communication n

LH** Concurrent Splitter LHH Manager

LHH Splitter

Figuree 2.1: The Data Server. usedd is a modified implementation of Main Memory Linear Hashing [Pet93]. Thee LH file in an LH*-bucket (Figure 2.2b) essentially contains (i) a headerr with the LH-level, an LH-splitting pointer, and the count x of ob- jectss stored, and (ii) a dynamic array of pointers to LH-buckets, and (iii) LH-bucketss with records. An LH-bucket is implemented as a linked list of thee records. Each record contains the calculated hash value [pseudo-key), a pointerr to the key, and a pointer to a BLOB. Pseudo-keys make the rehash- ingg faster. An LH-bucket split occurs when L = 1, with:

dxtn' ' wheree b is the number of buckets in the LH file, and m is a file parameter beingg the required mean number of objects in the LH-buckets (linked list). Linearr search is most efficient up to an m about 10.

2.2.22 LH* Partitioning of an LH File Thee use of LH allows the LH* splitting in a particularly efficient way. The reasonn is that individual records of the buckets are not visited for rehashing. Figuree 2.3 and Figure 2.4 illustrate the ideas. LHH and LH* share the pseudo-key. The pseudo-key has J bits as shown in Figuree 2.3; J = 32 at every bucket. LH* uses the lower / bits (6j_i, fy-2, •••^o)-

LHH uses j bits (6j+;_2, bj+(-3, ...6;), where j + 1 < J. During an LH*-split // increases by one, whereas j decreases by one. The value of the new Ith bitt determines whether an LH-bucket is to be shipped. Only the odd LH- buckets,, i.e. with bi = 1, are shipped to the new LH*-bucket N. The array 40 0 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

LH-Buckets s rows s Linkedd List of Records ^C C

-K K ^c c -»c c ^C C

Figuree 2.2: The LH-structure.

31 1

Beforee LH* Split

Afterr LH* Split

Beforee next LH* Split

Afterr next LH* Split

LH-bitss LH*-bits

Figuree 2.3: Pseudo-key usage by LH and LH*. 2.2.2.2. THE SERVER 41 1

Beforee the Split LH-Buckets s Linkedd List of Objects

Afterr the Split

LHH LH* Stays s

Figuree 2.4: Partitioning of an LH-fik: by LH* splitting. 42 2 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM off the remaining LH-buckets is compacted, the count of objects is adjusted, thee LH-bucket level is decreased by one (LH uses one bit less), and the split pointerr is halved. Figure 2.4 illustrates this process. Furtherr inserts to the bucket may lead to any number of new LH splits, increasingg j as shown in Figure 2.3 to some j'. The next LH* split of the buckett will then decrease j' to j' := j' - 1, and set / := / + 1 again.

2.2.33 Concurrent Request Processing and Splitting AA split is a much longer operation than a search or an insert. The split shouldd also be atomic for the clients. Basic LH* [LNS93] simply requires thee client to wait till the split finishes. For high-performance applications onn an SM multi-computer it is fundamental that the server processes a split concurrentlyy with searches and inserts. This is achieved as follows in LH*LH. Requestss received by the server undergoing a split are processed as if the serverr had not started splitting, with one exception: a request that concerns partss of the local LH structure already shipped is queued to be processed by thee Splitter. The Splitter processes the queue of requests since these requests concernn LH-buckets of objects that have been or are being shipped. If the requestt concerns an LH-bucket that has already been shipped, the request iss forwarded. If the request concerns an LH-bucket not yet shipped, it is processedd in the local LH table as usual. The requests that concern the LH- buckett that currently is being shipped first search the remaining objects in thee LH-bucket. If not found there, they are forwarded by the Splitter. All forwardingss are serialized within the Splitter task. More detailed information off the algorithm and the possible choices in implementation are given in Sectionn 2.4.3.

2.2.44 Shipping ShippingShipping means transferring the objects selected during the LH*-bucket split too the newly appended bucket N. In LH* [LNS96] the shipping was as• sumedd basically to be of the bulk type with all the objects packed into a singlee message. After shipping has been completed, bucket N sends back a commitcommit message. In LH*LH there is no need for the commit message. The Parsytecc communication is safe and the senders' data cannot be updated beforee the shipping is entirely received. In particular, no client can directly accesss bucket N before the split is complete. Inn the LH*LH environment there are several reasons for not shipping tooo many objects in a message, especially all the objects in a single mes• sage.. Packing and unpacking objects into a message require CPU time and memoryy transfers, as objects are not stored contiguously in memory. One alsoo needs buffers of sizes at least proportional to the message size, and aa longer occupation of the communication subsystem. Sending objects in- 2.3.2.3. NOTES ON LH*LH COMMUNICATIONS 43 3 dividuallyy simplifies these aspects but generates more messages and more overheadd time in the dialog with the communication subsystem. It does not seemm that one can decide easily which strategy is finally more effective in practice. . Thee performance analysis in Section 2.6.2 motivated the corresponding designn choice for LH*LH. The approach is that of bulk shipping but with a limitedd message size. At least one object is shipped per message and at most onee LH-bucket. The message size is a parameter allowing for an application- dependentt packing factor. The test data using bulks of a dozen records per shipmentt showed to be much more effective than individual shipping.

2.33 Notes on LH*LH Communications

Inn the LH*LH implementation on the Parsytec machine a server receiving a requestt must have issued the receive call before the client can do any further processing.. This well-known rendezvous technique enforces entry flow con• troll on the servers, preventing the clients from working much faster than the serverr can accept requests1 . There is no need to make the insert operations providee a specific acknowledge message, since communication is "safe" and thereforee not needed on the Parsytec machine. IAMs, split messages with thee split token, and general service messages use the asynchronous type of communicationn to remove the possibility of deadlocks. We avoid deadlocking byy never letting the servers communicate synchronously with a server having aa lower logical number. When a data client requests data from a data server, itt must receive the answer directly without engaging in any other communi• cation,, since the server otherwise would be blocked. This is enforced in the dataa client interface by making look-up operations atomic. Thee choice of synchronous communication for normal communication, for examplee not IAMs and similar control messages, does, however, not mean thatt the requests on the datastructure client must be synchronous. That is,, when a client issues the operation insert, the client only waits till the messagee has been delivered to the server, then both the server and the clientt can continue the processing. Presumably the server executes the insert operationn internally, but the client does not wait for any acknowledgement off the completion of the operation.

Thee overloaded server can run out of memory space and then send outdated IAMs; thiss is a fact when using PVM [MSP93] together with many active clients on a few servers. Theree is no flow-control when sending. Messages were stored internally by PVM, and the receivingg process eventually grew out of memory. This indicates the need for data flow control l 44 4 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

2.3.11 Communication Patterns AA parallel machine has a communication topology that is inherited from the hardwaree interconnect. For example, if a number of nodes are interconnected inn a ring communicating only in one direction, it is obvious that problems thatt communicate in the same manner benefits from such a topology. In this case,, a program that perforins pipe-lined execution would suit well, but a programm where all nodes communicate with the other nodes in a star pattern wouldd lose. This means that if a program communicates in such a way that itt can directly use the underlying topology minimizing the number of hops thee messages have to traverse, it will be efficient. Many general-purpose parallell machines try to mimic different topologies by imposing a virtual topologytopology onto the real topology. In the case of having a grid-type (mesh) interconnectedd multi-computer, a Parsytec for instance, the pipe-topology cann easily be implemented by a clever allocation of the nodes that minimize thee communication hops. Other types of topologies [LER92] include Star Network,, Toroidal Mesh, Tree Network, Ring Network, Fully Connected Network,, and N-Dimensional Hypercubes. Thee question in our case is whether LH*LH would benefit from any suchh topology; the answer depends very much on what kind of operations dominatess its use. If a known number of clients access data at a number of servers,, they could be placed in such a way that the total communication pathss were minimized. Also, servers could be placed in a Tree Network to minimizee communication when broadcasting requests, such as scanning, to ann entire LH*LH-file. One interesting fact to take into account is the way thee Parsytec routes messages. It routes them in a static way, always the samee route from one node to another, first horizontal and then vertical. Ann unfortunate allocation could place data clients in such a way that they interferee by using the same static routes for all communication. We leave thiss area open for research. Ann SDDS internal communication has a fixed pattern for communication, whichh stems from the splitting and forwarding strategy; in the LH* this is a tree-likee structure as can be seen in Figure 1.1. There is, however, no special patternn for the clients accessing the data structure, and then no natural speciall static topology that could be used for the LH* algorithms. However, inn the Parsytec environment, static communication links must be established too use the fastest means of communication. Therefore, when the program iss started, we establish links from each node to each of the other nodes forr the whole machine. Two links are in fact established one for data messagess and one for control messages. The latter is given higher priority in thee program. The rationale for this is that, for example, a split message or a tokenn should not be unnecessarily delayed to the receiver because this could badlyy influence operation. Mailbox communication 2, that is semantically

2Sometimess also referred to as store-and-forward. 2.4.2.4. LH*LH IMPLEMENTATION 45 5 betterr suited for SDDSs was found to be both slower and unreliable in the Parsytecc environment. Otherr communication needs arise when applying reduction, such as sum• mingg and scanning implementation. It is then natural to use some sort of "limited"" broadcasting or multicasting. Multicasting sends the same mes• sagee to a group of specified machines. It is favorable for initializing scanning operations.. On a Parsytec this has to be implemented by point-to-point primitives.. Simply implementing scanning by just iterating over a complete listt of all known Data Servers is not feasible since, among other things, it wouldd require the list and its size to be known in advance or updated during thee messaging. An easy way to accomplish is to use the hidden tree struc• turee that we discussed earlier which is formed during the splitting of the LH** nodes, this being a parent node, and new nodes can be seen as children nodes.. In this way the problem is easily distributed and the time complexity decreasess from 0{n) to 0{log(n)), where n is the number of data servers. Also,, this efficiently takes care of the problems with pre-splitted or merged serverss since the parent knows all about its children.

2.44 LH*LH Implementation

Wee will now go through the inner workings in the data server and the data client.. This involves how messages are sent, addressed, received and then dispatchedd to the actual data structure. We start with the initialization of thee system, and then we describe the client. As far as the client is con• cernedd we discuss what it does, why and how. Then we turn to the server continuingg the discussion. Communication choices are explained mainly in thee Parsytec environment. We will briefly mention experience from an early implementationn in the PVM environment [MSP93], too.

2.4.11 The System Initialization Thee first thing that is done during the initialization of the system is that the nodess calculate their logical machine address. This is currently an integer number;; in a larger environment this would require a mapping from this numberr to a more physical number, for example an IP-address. In PVM a simplee protocol is used to calculate this number in a negotiating processes withh the other nodes, whereas on the Parsytec the number is given by the operatingg system. This number is not directly used by the LH*LH data structuree but instead there is a mapping for each LH*LH file from a logical dataa server number to a (machine) node logical number. This is referred to ass Server Mapping. How such mappings can be implemented is discussed in Sectionn 2.4.4. Inn the Parsytec environment, during this phase we set up communica• tionn links between all nodes in all directions. This to be able to use the 46 6 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM fastestt means of communication. There are actually two links set up in both directions.. One for request and data communication, and one for control messages.. Examples of control messages include forcing a split to a new server,, and counting the servers. The reason for this is that some messages shouldd have a higher priority, and there is no way to receive these, in a sim• plee way, on a link without accepting all messages and then having to store them.. The control messages are always checked and executed first, and they shouldd be sparse compared to other communication. This also allows us to receivee request and data messages with entry control flow, implemented us• ingg a limited receive buffer. This means that the server node can hold a limitedd number of received requests in a queue and process them one by one.. It not only allows smoother operations on the client side if a server at somee point of time happens to receive more requests than usual, but it also prohibitss overloading a server. Experiencee shows that, for example the PVM communication package hass no other limit than available memory on the amount of messages it will receivee without the receiver program code asking for any messages. This is nott good for several reasons. First, memory overflow can easily occur at the receiverr when clients are sending requests faster than the server can handle them.. Second, the probability that clients at a node send messages to the wrongg server increases since the IAM will not be sent before the server has handledd the request. In the PVM environment there are no guarantees that anyy communication blocks3; the PVM communication package may (and soo it did) receive messages till it exhausts memory, even if the program at thee receiving end does not ask for any messages, i.e., executes a receive operation;; this also increased the number of IAMs since requests queued up. Bothh of these aspects are implemented in an interchangeable layer where thee specific code for a machine and operating system environment resides. Itt is currently assumed that all the data clients and data servers run onn "equal" computers. By this we mean that they form a networked multi• computer,, that is, a homogeneous environment as regards communication softwaree and libraries. However, to be able to work in a heterogeneous en• vironmentt is just a matter of changing a well-defined layer in the software inn such a way that it behaves compatibly.

2.4.22 The Data Client AA data client is an execution thread. It accesses data stored on clusters of dataa servers in (LH*LH) files and this can be extended to other distributed (scalable)) data types as well. Several data clients can cooperate in fulfilling thee same goal, i.e. a search can be split by a program into several data clients thatt use the data client interface that then access, process and collect the

3Thatt is, the send operation returns only when the data has been received at the other end,, e.g. synchronous communication. 2.4.2.4. LH*LH IMPLEMENTATION 47 7 informationn and then send it to the requester. There can be several data clientss on one node using different threads. The clients can then jointly benefitt from the same Image Adjust Messages that other data clients at the samee node received. Thee data client uses the Data Client Interface with which it can access anyy LH*LH file stored in the network computer. A file is identified by a TablelDD and a ServerList; the latter contains the Server Mapping men• tionedd earlier, and the former is a unique integer number that identifies the distributedd file. When a file is opened, a pointer to a handle structure is returned.. This handle identifies the file and stores the current client image statee of that file. Thus, if the same handle for the file is used by more than onee client thread, the updates will benefit all of them. This handle is then usedd in all calls to the data client interface.

Functionn Outline Whenn a client performs an insert operation on a file, the following operations willl occur internally in the data client interface. First, the pseudo-key is cal• culatedd using a hashing function. Using the pseudo-key and the client image forr that file which is stored in the data accessible through the file handle, thee client calculates the logical number of what it believes is the appropriate dataa server that should store the data. This number is then mapped to the machinee node number. A message is then assembled containing the identity numberr of the file, the key, the data as a blob of bytes, its associated size in numberr of bytes, and the client image4. This message is then directly sent too the calculated destination server. Inn the Parsytec environment the synchronous communication and our entryy flow control ensure that the message will not be transferred before the serverr actually issues a receive. After the message has been received by the calculatedd destination server, the data client code directly returns to the callerr if it was an insert operation. Otherwise, if the operation requires an answerr from a server, it blocks and awaits the message. It returns when the answerr is received. The client can then directly issue a new operation on the distributedd file. Several client threads can individually communicate with thee file without disturbing each others' operations.

Imagee Adjust Messages Thee client image variables should be adjusted when the client makes an addressingg error. It is the responsibility of the LH*LH file servers to send aa message that corrects the client. In LH* this message is sent by the final receivingg server when it identifies the that the message has been forwarded too it. However, in LH*LH we have chosen to let a server that forwards

44 Actually in our case only the level of the LH*LH file is sent. 48 8 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM thee request also send the update message to the client. There are some advantagess and disadvantages with this approach. The number of image adjustt messages (IAMs) will increase for new clients, since forwarding can occurr a couple of times and each of them will yield a new I AM. On the other hand,, since each forwarding takes a while, the client will get the message backk and can then adjust its image before the final recipient server has processedd the request, thus the next request of a highly active node will use a moree appropriate destination. This will not only reduce the delay before the clientt gets a more updated image, it can also increase the throughput of the clientss on the same node when more servers are employed, and thus the whole distributedd file will benefit from this. The individual servers then become less loadedd and forwardings can be avoided. A possibly more serious drawback iss that the client has a better image of the file before its previous request hass reached its final destination5. However, the same thing can occur, even iff the IAM is sent only when the request has reached its final destination6. Too avoid this problem, the servers could be required to always send back ann acknowledge message when the operation has been totally completed, andd then requiring the clients to wait for it. This is, however, an inefficient solutionn that limits the throughput of the clients, and still the same problem cann occur between different clients performing joint work. Inn LH* it is proposed to send the IAM piggy-backed on the reply message, whichh is not done in LH*LH where we do not use reply messages when there iss no data to be returned7. The IAM message is sent using asynchronous communicationn primitives on the Parsytec Machine. This solves some prob• lems,, but can also raise others as discussed below. The encoding of an IAM alsoo includes the file identity number. This enables the data to be directed too the appropriate client file image.

Waitingg for a message Whenn using the communication links on the Parsytec, there is no way when receivingg or waiting for messages to choose among messages or probe mes• sagess before they have been received. The link is a communication channel thatt works like a pipe in that messages can be sent from one originator only too the other end, the destination. A client accessing data on an SDDS will by necessityy have to expect answers from any8 sender. The result is that it has

5Forr example, a client inserts some data. This yields an IAM that is sent back to the client.. The client sends a request for the same data, but to a more correct address, we are noww not guaranteed that the data has been inserted at this address. It could have been delayed,, and still be waiting to be forwarded. 6Thee actual scenario is similar to the previous one, but involves another request. This requestt is sent before the search, and triggers an IAM. If the previous insert still has not reachedd its destination, the search may produce an unexpected result. 7Rememberr that the link communication on the Parsytec is safe. 8Att least from any server. 2.4.2.4. LH*LH IMPLEMENTATION 49 9 too listen to all links and receive all messages to actually get the message it is waitingg for. To avoid this, and to be able to code a message so that it can be selected,, one has either to explicitly code a two-way communication protocol thatt negotiates by sending extra information or set up a communication link especiallyy for this type of messages. The latter was our choice of implemen• tation.. Each link requires a certain amount of buffer space and will thus be associatedd with some cost. The total number of links to be established will thenn be linear with respect to the number of different message types and quadraticc to the number of nodes. One alternative on the Parsytec is to use thee mailbox communication primitives. They are, however, not guaranteed too be delivered, limited in size and slow. But they are an alternative when thee number of messages is relatively small, their sizes limited, and when the losss of one individual message will not cause any real harm. This is the case forr the IAMs.

Suggestedd Improvements Whenn a file is opened, the handle returned should be a handle to the already existingg structure, if the file has already been opened. This is, however, just aa matter of programming style since the variables containing the handles cann be shared or copied. AA key is currently limited to be a null-terminated character string. The hashh function calculating the pseudo-key is fixed. It should be definable per file,file, the data type of the key and what hashing function to use. Forr efficiency one would like to send the already calculated pseudo-key inn any request having a key. But we choose not to do so, hoping to keep the interfacee more general. This makes the interface independent of the actual dataa distribution algorithm. A counter-example is an RP* file that has no "pseudo-key". . Dataa flow control is hardware/software dependent. On the Parsytec this iss enforced by the synchronous link messaging primitives, which probably hass a two-way negotiation communication phase, since the message is not receivedd before the receiver asks for it. Generally, in environments such as TCP/IPP (sockets or using PVM) we would have to do this ourselves. Thee information in the client image that is sent in a request can, as noted,, be reduced in some cases. In the LH* algorithm only the level is transferredd by the client, and in an image adjust message (IAM) the logical numberr of the file server is also included. From this information the state, ass far as the server knows it9, can be reconstructed by the client. Ann alternative to using the asynchronous communication on the Parsytec

9Thee server in LH* does not know the actual state of the file, it can only assume that logicall servers with a smaller number are of at least the same LH* level as itself, and serverss with a higher number can only be assumed to be of a level no higher than itself, presplittingg not taken into account. 50 0 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM machiness is to use the control messages link. This should be relatively safe, sincee the number of IAMs will be limited. This, however, requires the receiver codee to know of where to store the information from the IAMs and that it hass to keep track of all file handles and their associated data.

2.4.33 The Server Thee data servers store the distributed files. A subset of the data servers cooperatee in storing one file. That is, there is one subset of nodes for each filee and a specific order of them is called a server mapping that maps from locall logical server numbers to physical numbers.

Functionn Overview Thee data server works on messages initiated by a data client. Messages in ourr Parsytec implementation can be received on either of two links. When a messagee has been received, an event handler extracts the type of the message. Thiss type that is represented as a number is then used for looking up in a tablee what function is to be called. The given function is then called. Let uss for example assume that we received a Find typed message. Then the receiverr message-handler function is called with the message buffer as its argument.. From this buffer it extracts the key, the data and its associated lengthh and the client image. First, the handle of the bucket stored at this serverr is retrieved. If such a handle is not found, there is an error in the clientt addressing mechanism10. Then a check is made as to whether or not thee key hashes to this or another server node. This checks the client's image andd if it is not the correct server, it will forward the message and send ann IAM to the client that requested the operation; then this request is no longerr of concern to the server that received it. Eventually, the message willl reach the appropriate data server node. When it has been shown that thee request concerns the data server that received the request, processing continuess there.

LH*LHH Workings Beforee data normally would be inserted in LH*LH, the server first has to checkk that it is not currently splitting the file that the request is accessing. Iff that file is being split, the request might have to be sent, forwarded, to thee new server. If so it must be done without updating the client, since the officialofficial image of the file does not yet include the splitting destination. In aa more general perspective this goes well in hand with the discussion of thee presplitting of servers that is presented in [LNS93]. If the file undergoes

10Itt might also occur if a distributed file has shrunk and thus the server no longer stores aa part of that file. In this case, forwarding information should have been left behind at thee node to be used to direct the clients and send them a "special", sort of reverse, IAM. 2.4.2.4. LH*LH IMPLEMENTATION 51 1 aa split, the key is then checked for whether it would need access to that partt of the file that has already been moved. If this is the case, the request iss forwarded without adjusting the client's image; nor should the receiving serverr send an IAM to the clients image11. Otherwise, if the data concerns LH-bucketss that have not yet been sent, then this data can then be inserted locallyy into the storage structure. Thee tricky part here involves requests that concern data in the LH-bucket currentlyy being moved. If key uniqueness applies, i.e., only one object can bee stored for each key, some operations must be handled with care. If an objectt can be found locally using the key, we overwrite as usual, otherwise wee insert it. When this object is shipped to the new server, the same action willl take place12. Look-ups, however, must first look in the local data; if a matchingg key is found we return it, otherwise the request is forwarded to the neww server. Deletions are special. We have to delete locally and then forward thee request, since we do not have any knowledge about the existence of the dataa at the new server. If several objects can be stored with the same keys, i.e.,, it is used as a secondary index, inserts can be made locally; they will bee moved later in any case. Look-ups, on the other side have to perform look-upp first locally and, independently of the result, forward the request to thee new server and then execute it there. Here another complication arises bothh the splitting server and the new server might have matching data. Then bothh have to return the answers to the requesting client, which puts higher demandss on the implementation on the client which has to be able to receive partiall answers from several servers! A more simplistic alternative is to delay thee look-up operation until the whole LH-bucket has been transferred, and thereafterr forward the request13. Deletions will also have to be executed at bothh places. When data is shipped in the non-unique key variant, it may nott overwrite any of the already existing data that matches the key. Itt is very important that when forwarding to the new server occurs, no IAMM is sent and that when the reply returns to the client, the client does not updatee its image, even if the client could deduce from the responding server's numberr that there is (at least) one more server that it does not know about. Ourr clients only update themselves using information from IAMs. The new serverr is not allowed to handle any requests from any client regarding the filee that is being expanded onto that server. This would bypass the LH*LH

111 It will automatically be updated on the next addressing fault, when the official image off the file has changed. 122 Actually, data identified by a key could be overwritten several times during the split• ting.. Either several clients send their request close in time, or some client sends a stream of updates.. Then the old data will be overwritten first locally, then when it has been moved, thee second insert will not overwrite any data, but it will overwrite later when it is moved. Whichh of these overwrite inserts that in the end will survive depends on the order that theyy arrive in at the splitting server. 13Alll requests should be queued then and forwarded in the correct order. The success off this method depends heavily on the time it takes to transfer one LH-bucket. 52 2 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

concurrencyy algorithm. Anotherr important aspect during splitting is the order of the operations andd forwardings that we perform. We have to ensure that there is a linearity inn time between the forwarding and the shipping of data. This means that wee have to assume either that messages are handled at the new server in thee order that they arrive and that this is the same order as they were sentt from the splitting server14, or we do this linearization ourselves. We takee the latter approach. There is a special thread that does the splitting; thiss thread processes any waiting requests to the data server in-between thee shipping operations. Thus messages will be serialized, and sent over thee same links. This will guarantee that the order of the messages we send willl be preserved at the new server. This solution removes any concurrency problems,, but relies on a local queue of requests to be built from requests insteadd of executing them directly.

Suggestedd Improvements Whenn a server receives a request, a check is made as to whether or not itt has reached its destination, according to the server itself. If so the LH* algorithmm normally does not assume this to be an addressing problem and thereforee does not send any IAM. However, from the information given by clientt image level, it might be deduced that the client needs updating, and thee server could then issue an IAM even if there was no addressing error. This,, however, is not part of the original LH* algorithm but fits quite well withh the LH*LH algorithm and its updating and forwarding mechanism. Whenn a request is answered by a new server, that is not yet officially partt of the distributed file, no IAMs are allowed to be sent that could inform thee client to know about this new server. One way to implement this, is by masqueradingg the answer, which means making the server fake its node numberr to be the same as the server that is splitting. Another solution is too ensure that the clients do not use this information. In either case, no clientt should be able to send requests to the new server before the shipping iss finished at the splitting server. The third, and probably the most secure implementation,, would let the server that is splitting act as a proxy for the clientt and send the request and return the data to the client. This would involvee one step of unnecessary communication.

2.4.44 Server Mapping Thee Server List contains the means for mapping the logical server node numberr of the SDDS, in this case that of the LH*LH-file, to a physical machinee number. The LH* algorithm does not concern itself with it. Some

144 This means that there has to be full control over or full guarantees on the order that messagess are assembled, buffered, stored, sent, transferred, received, and stored. 2.4.2.4. LH*LH IMPLEMENTATION 53 3 ideass are, however, presented in [LNS96]. In LH*LH we currently use a static table,, known by all nodes in the multi computer. But the management and distributionn of this table itself is a problem potentially of high proportions. Eachh client accessing the SDDS and each server storing the SDDS has to be ablee to do this mapping. One easy solution is to state that this mapping table willl not grow too large and can therefore be stored at each participating server.. When a client makes an addressing error, the missing information couldd be added to the IAM (Image Adjust Message) and then update the client.. This solves the problem of just having a central node storing and deliveringg the map on request, but it would be a non-scalable solution since thee actual allocation of nodes does not scale. This particularly concerns cases whenn the clients are many and possibly short-lived. Changess in the mapping will either have to be distributed to the affected clientss and server, or using an SDDS-like schema be updated by the (old) serverr receiving the faulty request. One idea, untested though, is to add anotherr image variable that somehow is a "unique" id of the mapping it hass (a simple solution would be a checksum of the mapping data), and if a serverr detects that the "image" is out of date, it then adjusts it with an IAM, addingg the new mapping. If the client accesses a server that no longer has anyy knowledge of the accessed SDDS file, it can then use a fall-back scheme thatt shrinks the image level by level, till only, in the end, the logical server 00 is contacted. If this fails the client has to "Reopen" the file i5. Reopening thee file might have to use a directory service for finding the SDDS file from itss ID.

Autonomouss "randomized" mapping Anotherr idea is to not have the server list fully materialized, but instead usee a scheme where the mapping is calculated in a stable and reproducible way.. This can be done both at the clients and server totally independently. If theree are a limited number of machines that can be used and a large amount off files to be stored, a randomizing schema could be used for allocation of neww server nodes to the file. This might also provide benefits of randomized loadd balancing, assuming random creation and growth of the stored files. Thus,, the server list can be seen as a data structure with operations thatt can reproduce the actual mapping. This is especially needed in a large distributedd autonomous environment with many nodes. An example of such ann algorithm would be a randomizing function with a seed specific for the distributedd file (the id, for example), and with a list of servers participating thatt are willing to participate in the distribution. It is known that most randomizingg functions are not really random. With the same seed the same sequencee of numbers can be reproduced; these; numbers (usually between

155 Accessing a SDDS file requires knowledge at least of the first server's physical address. 54 4 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

00 and 1) can then be used to remove identities of willing (unused) nodes fromm the list of all nodes. All clients (and servers) using the same function, seedd and so on. will then yield the same sequence. The possibly unwanted propertyy of this method is that this list of "willing" servers has to be main• tainedd or allocated. However, the algorithm does not require a total static list,, it can be allowed to grow in chunks and communication can then be minimized. .

DNSS alike mapping (internet) Another,, more appealing method for internet distribution can be to use the internett DNS databases that provide mapping from a logical name to a phys• icall ip-address. For example, we could register a domain lhlh.ida.liu.se under whichh we let each LH*LH-file be identified by, for example, a unique name, lett us say film. We then let the first node (logical node numbered zero) in thiss LH*LH-file be 0.film.lhlh.ida.liu.se and the second Lfilm.lhlh.ida.liu.se andd so on. This provides physical independence, it also scales since the DNS serverss employ caching. However, the drawback with this schema is that whenn the nodes then move, it may take a rather long time before this infor• mationn reaches out through the distributed database16.

2.4.55 Summary and Future Work Summary y

Soo far, we started out describing databases and future application that will usee database technology. We identified the need of high-performance, scala• bilityy and high availability, and this led up to parallel data servers. Dynamic (re-)scalabilityy is achieved by using the new type of data structures, the Scal• ablee Distributed Data Structures (SDDSs). We have described a real imple• mentationn of an SDDS, the LH*LH. The behaviour of LH*LH was studied andd performance measures were given. It was shown that the local bucket managementt is an important issue for implementation of high-performance SDDSs.. Several thoughts around the implementation and details of prob• lemss involved were discussed. Also, potential areas of problems have been identified. .

Issuess Not Covered

Wee have not covered any deeper performance analysis of LH*LH under var• iouss conditions. The use of SDDSs in a transactional environment with sev• erall interfering autonomous clients working is a particular interesting area. AA formal performance model is also needed. In general such models are not yett available for the SDDSs. The task seems of even greater complexity than

16Itt may take from a few hours to some days depending on the software used. 2.4.2.4. LH*LH IMPLEMENTATION 55 5 forr more traditional data structures. Also, if the algorithm is to be used for moree than one file, a different physical mapping (e.g. randomization) to the nodess should be used for each file to distribute the load. Several different solutionss could apply here, but this is an open area. The handling of the list off participating servers for a SDDS-file also needs to be scalable, and this needss further investigation. The ideas put into the LH*LH design should alsoo apply to other known SDDSs. They should allow for the corresponding variantss for switched multi-computers. One benefit would be scalable high- performancee ordered files. The SDDSs in [LNS94], or [KW94] would seem too be a promising basis on which to aim for this goal.

Subsequenctt Issues Mostt SDDSs were created and motivated on their own, the usefulness easily understoodd by it's mere design concept. However, integrating SDDSs as a componentt of a DBMS, one may expect important performance gains. This wouldd open new application perspectives for DBMSs. Video servers seem to bee a favorable axis, as it is well known that major DBMS manufacturers are alreadyy investigating switched multi-computers for this purpose. The com• plexx real-time switching data management in telephone networks [Ron98] seemss to be another domain of interest. Havingg a rich set of modeling capabilities in newer database systems, suchh as object-orientation, gives rise to other questions. Are objects and theirr attributes to be stored differently depending on their position in the object-typee hierarchies? How do we store what objects in an SDDS file? Howw does this relate to object-oriented querying? User defined predicates? Ass can be seen there are many interesting issues that will arise.

2.4.66 Host for Scientific Data Too approach these goals, we plan to make use of the implementa• tionn of LH*LH for high-performance databases. Interface an SDDS with aa main-memory research database system, such as AMOS [FRS93] or MONETT [BK95], potentially allows for much higher processing speeds than ann individual non-distributed system would achieve. For easy integration it iss important that the database systems are extensible at many levels. An idea,, proposed in [Kar97], is to use an ordinary AMOS on a single work• station,, where some data types, relations or functions would be stored and searchedd by the MIMD machine. AMOS can then act as a front-end system too the parallel stored data. The query optimization of AMOS then needs to bee extended to also take into account the communication time and possible speed-upp gained by using distributed parallel processing. Monet [BK95] is, onn the other hand, optimized for bulk processing. In the concept of SDDSs thiss would mean that individual and/or streamed requests would be of less 56 6 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM importance,, favoring operations on larger chunks. [KK97] introduces the ideaa of Live Optimization that allows for an on-the-fly optimization by local operators. . SDDSss other than LH* are also of interest for evaluation, one candidate beingg the RP* [LNS94] that handles ordered data sets.

2.55 Hardware Architecture

AAAA A WW W

Transnntp.r r

PowerPC C r T805 5

cache e cache e

RAM M

AAAA A WW W

Figuree 2.5: One node on the Parsytec machine

Thee Parsytec GC/PowerPlus architecture (Figure 2.5) is massively par• allell with distributed memory, also know as MIMD (Multiple Instruction Multiplee Data). The machine used for the LH*LH implementation has 128 PowerPC-6011 RISC-processors, constituting 64 nodes. One node is shown inn Figure 2.5a. Each node has 32 MB of memory shared between two Pow• erPCC processors and four T805 Transputer processors. The latter are used 2.5.2.5. HARDWARE ARCHITECTURE 57 7

Q(j)Q(j)QQQQ Q (MMMMMMM) ) CMM> > -(M M CMM* * -(M M (MMMMmm KM MM) V/-CMM >-IHWM

2.5.11 Communication Thee communication is point-to-point, and the software libraries supports bothh the synchronous and asynchronous communication [Par94]. Connec• tionss can be established using links, optionally on a virtual topology. Mailbox-communicationn is also available, which is the most general form of communication,, but is limited due to the store-and-forward principle used. Broadcastss and global exchange are implemented using the above presented formss of communication presented above. Thee number of virtual connections is not limited by the number of phys• icall connections. The virtual connections can be used to implement a hand• craftedd topology of the processors. There are also predefined libraries for thee most common topologies such as pipe, 2- dimensional, 3-dimensional, 2-,, 3-torus, tree and hyper-cube. Thee response time of a communication depends on the actual machine topology.. The closer the communicating nodes are, the faster is the response. Routingg is done statically by the hardware as in Figure 2.6b with the pack• agess first routed in the horizontal direction. 58 8 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

00 12 3 OQQOi i "eoooi i oooo o

Servers s

Clients s QQ QO O | 'OO O O Ol 'OO Q Q O-

288 29 30 31

Figuree 2.7: Allocation of servers and clients. 2.5.22 Measure Suite Wee performed the tests using 32 nodes since, at the time when the tests were performed,, only 32 nodes were available at our Parsytec site. The Clients weree allocated downwards from node 31, while servers were allocated from nodee 0 and upwards, as shown in Figure 2.7. The Clients read the test data (aa random list of words) into main memory in advance to avoid the I/O disturbingg the measurements. Then the clients started inserting their data, creatingg the test LH*LH-file. When a client sent a request to the server, it continuedd with the next item only when the request had been accepted by thee server (rendezvous). Each time, just before the LH*LH file was split, measuress were collected by the splitting server. Some measurements were alsoo collected at some clients, especially timing values for each of that client's requests. . 2.66 Performance Evaluation

Thee access performance of our implementation was studied experimentally. Thee measurements below show elapsed times of various operations and the scalabilityy of the operations. Each experiment consists of a series of inserts 2.6.2.6. PERFORMANCE EVALUATION 59 9 creatingg an LH* file. The number of clients, the file parameters M and TO, andd the size of the objects are LH*LH parameters.

2.6.11 Scalability

350 0

300 0

250 0

,w. . 200 0 "c" " 1-- 150 0

100 0

50 0

500000 100000 150000 200000 n n

Figuree 2.8: Build time of the file for a varying number of clients.

Figuree 2.8 plots the elapsed time taken to constitute the test LH*LH file throughh n inserts'; n = 1,2..N and N = 235.000, performed simultaneously byy k clients k = 1,2..8. This time is called build time and is noted Tb(n), or TbTbkk(N)(N) with A; as a parameter. In Figure 2.8, Tb(N) is measured in seconds. Eachh point in a curve corresponds to a split. The splits were performed using thee concurrent splitting with the dynamic control and the bulk shipping. Thee upper curve is Tbl(n). The next lower curve is Tb2(n), and so on until TbTb88{n). {n). Thee curves show that each Tbk{n) scales-up about linearly with the file sizee n. This is close to the ideal result. Also, using more clients to build the filee uniformly decreases Tbk, i.e.,

k'k' >k"- >Tbk'{n)

ideall scale-up should reach k times, i.e., the build time Tb8(N) - 40 sec only.. The difference results from various communication and processing de• layss at a server shared by several clients, as discussed in previous sections andd in what follows.

1.6 6

1.4 4

1.2 2 II 1 §:§: o.8

0.6 6

0.4 4

0.2 2 500000 100000 150000 200000 n n

Figuree 2.9: Global insert time measure at one client, varying the number of clients. .

Figuree 2.9 plots the curves of the global insert time

TiTikk{n){n) =Tbk(n)/n[msec}. TiTi measures the average time of an insert from the perspective of the ap• plicationn building the file on the multi-computer. The internal mechanics off the LH*LH file is transparent at this level including the distribution of thee inserts among the k clients and several servers, the corresponding par• allelismm of some inserts, the splits and so on. The values of n, iV and k are thosee shown in Figure 2.8. Increasing k improves Ti in the same way as for Tb.Tb. The curves are also about as linear, constant in fact, as they should be. Highlyy interestingly, and perhaps unexpectedly, each Tbk(n) even decreases whenn n grows, the gradient increasing with k. One reason is the increasing numberr of servers of a growing file, leading to fewer requests per server. Also, ourr server and client node allocation schema decreases the mean distance throughh the net between the servers and the clients of the file. Thee overall result is that Ti is always under 1.6 msec. Increasing k uni• formlyy decreases Ti, until Ti8(n) < 0.8 msec for all n, and Ti8(N) < 0.4 2.6.2.6. PERFORMANCE EVALUATION 61 1 msecc in the end. These values are about ten to twenty times smaller than accesss times to a disk file, typically over 10 msec per insert or search. They aree likely to remain forever beyond the reach of any storage on a mechanical device.. On the other hand, a faster net and a more efficient communication subsystemm than the one used should allow for even much smaller TVs, in the orderr of dozens of microseconds [LNS96] [LNS94].

3000 0 ii r Globall Insert

2500 0

ÉÉ 2000 =3 3 Q_ _

O) ) o o 1500 0 000 00—o-o—COO 0 <

1000 0

|^»^»0OOO0OOM0OO00 00 00 000 0 < 500 0 II 1 I I 500000 100000 150000 200000 n n

Figuree 2.10: Actual throughput with varying number of clients.

Figuree 2.10 plots the global throughput Tk(n) defined as

TTkk{n){n) = l/Ti(n)[i/sec](insertsper second).

Thee curves express again an almost linear scalability with n. For the rea• sonss discussed above, Tk even increases for larger files, up to 2700 i/sec. An increasee of k also uniformly increases T for every n. To see the through• putt scalability more clearly, Figure 2.11 illustrates a plot of the relative throughput t Tr(k)Tr(k) = Tk(n)/T\n), forr a large n, n = N. One compares Tr to the plot of the ideal scale- upp that is simply T'r(k) — k. The communication and service delays we spokee about clearly play an increasing role when k increases. Although Tr monotonicallyy increases with k, it diverges more and more from T'r. For kk — 8, Tr = 4 which is only the half of the ideal scale-up. It means that the 622 CHAPTER 2. THE LH*LH ALGORITHM actuall throughput per client,

TcTckk(n)(n) =Tk{n)/k, alsoo comparatively decreases to half of the throughput T1 of a single client.

8 8

7 7

6 6 =3 3 Q. . O)) O O ÜÜ 4 CD D 11 3 O O

2 2

1 1

0 0 012345678 8 clients s Figuree 2.11: Ideal and actual throughput with respect to the number of clients. .

Figuree 2.12 and Figure 2.13 show the comparative study of the dynamic andd the static split control strategies. The plots show build times, with Tb'(n)Tb'(n) for the static control and Tb(n) for the dynamic one. The curves correspondd to the constitution of our example file, with k = 1 in Figure 2.12 andd k — 4 in Figure 2.13. The plots Tb are the same as in Figure 2.8. Figuree 2.12 and Figure 2.13 clearly justify our choice of the dynamic control strategy.. Static control uniformly leads to a longer build time, i.e., for every nn and k one has Tb'(n) > Tb[n). The relative difference (Tb' - Tb)/Tb reachess 30% for k = 1, e.g. Tb'(N) = 440 and Tb(N) = 340. For k = 4 the dynamicc strategy more than halves the build time, namely from 230 to 100 sec. . Notee that the dynamic strategy also generates splits generally more uni• formlyy over the inserts, particularly for k = 1. The static strategy leads to shortt periods when a few inserts generate splits of about every bucket. This createss a heavier load on the communication system and increases the insert andd search times during this period. 2.6.2.6. PERFORMANCE EVALUATION 63 3

450 0

•aa 250 -

500000 100000 150000 200000 n n

Figuree 2.12: Comparison between Static and Dynamic splitting strategy, one client. .

£ÜU U II I I i Static^^ ^ Dynajnitt -H—

200 0 —— £>s

150 0 W W ,, „ c c X3 3 h-- 100 0

50 0 // ^—

n n ii i i i 500000 100000 150000 200000 n n

Figuree 2.13: Comparison between Static and Dynamic splitting, with four clients. . 64 4 CHAPTERR 2. THE LH*LH ALGORITHM

2.6.22 Efficiency of Concurrent Splitting

ii i i i i i i AA t o mic, Ind i ui du a 1 shi ppi ng

88 50 100 150 200 L5U 3 8 0 358 400 450 tt [seconds] elapsed time

Figuree 2.14: Efficiency of individual shipping.

Figuree 2.14 shows the study of the comparative efficiency of individual andd bulk shipping for LH* atomic splitting (non-concurrent), as described earlier.. The curves plot the insert time Til(t) measured at t seconds during thee constitution of the test file by a single client. A bulk message contains at mostt all the records constituting an LH-bucket to ship. In this experiment theree are 14 records per bulk on the average. A peak corresponds to a split inn progress, when an insert gets blocked till the split ends. Thee average insert time beyond the peaks is 1.3 msec. The corresponding Pi'ss are barely visible at the bottom of the plots. The individual shipping, shownn in Figure 9a, leads to a peak of Ti = 7.3 sec. The bulk shipping plot, Figuree 2.15, shows the highest peak of Ti = 0.52 sec, i.e., 14 times smaller. Thee overall build time Tb(N) also decreases by about 1/3, from 450 sec (Figuree 2.14), to 320 sec (Figure 2.15). The figures clearly prove the utility off the bulk shipping. Observee that the maximal peak size was reduced according to the bulk size.. This means that larger bulks improve the access performance. However, suchh bulks also require more storage for themselves as well as for the inter• mediatee communication buffers and more CPU for the bulk assembly and disassembly.. To choose the best bulk size in practice, one has to weigh all 2.6.2.6. PERFORMANCE EVALUATION 65 5

0.6 6

.5 5 +> +> L. L. (1J J W W c c 0.. 4 \ \ IN N "Ö Ö c c 0.33 - o o Ü Ü 111 1 V) V) 0.2 2

f— — 0.. 1

jj 0 100 150 £ 0 0 2 5 0 UU C tt [seconds] el ap s e d t ime

Figuree 2.15: Efficiency of bulk shipping. thesee factors depending on the application and the hardware used. However, noo bulk message contain more that one LH-bucket currently, but the LH*LH algorithmm can be extended to ship more buckets17. A more attractive limit iss the size of the buffer to be used, rather than the number of records. Figuree 2.16 shows the results of the study where the bulk shipping from Figuree 2.15 is finally combined with the concurrent splitting. Each plot Ti(t) showss the evolution of the insert time at one selected client among k clients; kk = 1..4,8, concurrently building the example file with the same insert ratee per client. The peaks at the figures correspond again to the splits in progresss but they are much lower. For k = 1, they are under 7 msec, and for kk = 8 they reach 25 msec. The worst insert time with respect to Figure 2.15 improvess thus by a factor of 70 for k = 1 and of 20 for k = 4. This result clearlyy justifies the utility of the concurrent splitting and our overall design off the splitting algorithm of LH*LH. Thee plots shown in Figures 2.16a to 2.16e show the tendency towards higherr peaks of Ti, as well as towards higher global average and variances of TiTi over Ti(t), when more clients build the file. The plot in Figure 2.16f con• firmsfirms this tendency for the average and the variance. Figures 2.16d and 2.16e

17Thiss might not be so attractive, since the algorithm gets to be slightly more compli• catedd and the bulk messages too big. 66 6 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

Onee cl ient

1000 150 200 £50 300 elapsedd time [si

(a)) One active client (b)) Two active clients

(c)) Three active clients (d)) Four active clients

(e)) Eight active clients (f)) Average, std. deviation

Figuree 2.16: Efficiency of the concurrent splitting. PERFORMANCEPERFORMANCE EVALUATION 67 7

200 40 60 SO 100 120 140 elapsedd time [s]

(a)) One active client (b)) Two active clients

(c)) Three active clients (d)) Four active clients

(e)) Eight active clients

Figuree 2.17: LH*LH client insert time scalability. 68 8 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM alsoo show that the insert times become especially affected when the file is stilll small, as one can see for t < 10 in these figures. All these phenomena aree due to more clients per server for a larger k. A client has then to wait longerr for the service. A greater k is nevertheless advantageous for the global throughputt as was shown earlier. Figuree 2.16 hardly illustrates the tendency of the insert time when the filefile scales up, as non-peak values are buried in the black areas. Figure 2.17 plots,, therefore, the evolution of the corresponding marginal client insert timetime Tmk. Trnk is computed as an average over a sliding window of 500 insertss plotted, as can be seen in Figure 2.16. The averaging smoothes the variabilityy of successive values giving the black areas in Figure 2.16. The plotss Tmk(t) show that the insert times not only do not deteriorate when thee file grows, but even improve. Tml decreases from 1.65 msec to under 1.2 msec,, and Tms from 8 msec to 1.5 msec. This satisfactory behavior is due againn to the increase in the number of servers and to the decreasing distance betweenn the clients and the servers. Thee plots show also that Tmk{t) uniformly increases with k, i.e.

k"k" >k'^Tmk"{t) >Tmk'(t), forr every t. This phenomenon is due to the increased load of each server. Anotherr point of interest is that the shape of Tmk becomes stepwise, for greaterr &'s, with insert times about halving at each new step. A step cor• respondss to a split token trip at some level i. The drop occurs when the lastt bucket of level i splits and the split token comes back to bucket 0. This tendencyy seems to show that the serialization of inserts contributing most too a Tmk value occurs mainly at the buckets that are not yet split. Thee overall conclusion from the behaviour illustrated in Figure 2.17 is thatt the insert times at a client of a file equally shared among k clients is basicallyy either always under 2 msec, for k — 1, or tends to be under this timee when the file enlarges. Again this performance shows excellent scale-up behaviorr of LH*LH. The performance is in particular greatly superior to the onee of a typical disk file used in a similar way. For k = 8 clients, for example, thee speed-up factor could reach 40 times, i.e., 2 msec versus 8 * 10 msec.

2.77 Curiosity

Too test the performance of LH*LH on the Parsytec machine, we made a largee number of batch runs, where numerous different data sets were col• lected.. Using advanced scripts, we automatically filtered, calculated different scaleupp curves, and finally plotted them to files. Many of the plots contain soo many measure points that gif-files had to be used instead of postscript drawings,, which decreased the size of such a plot from around 1 MB to 30 KB. . 2.8.2.8. CONCLUSION 69 9

Duringg the process of analyzing the data the AMOS system [FRS93] was experimentallyy used. Data was imported and the AMOSQL [KLR+94] query languagee could be used to construct plots on-the-fly. As an example of such aa query, we plot the throughput curve from 1 server to 8 servers: createe function Speed(Measure m)->real as selectt operations(m)/Elapsed(m); plot((( select servers, maxspeed forr each integer servers, real maxspeed wheree maxspeed = maxagg(Speed(FSer(servers, , FQMC'W", , FPC(32, , MeasuresO))))) ) andd servers = iota(l, 8), "X-Servers,, Y-MaxSpeed, Write, PC=32");

First,, we declare a derived function Speed (measure) that takes a mea• suree point and calculates the number of operations (measure) per sec• ondd Elapsed(measure). Then we plot(bag of , header string). Thee actual plotting is done by automatically calling the gnuplot program. Thee bag of tuples is constructed through an AMOSQL query, where serverr is the number of servers, plotted along the x-axis, and maxspeed is thee maximum Speed for that number of servers writing "(FQM("W" , ...))", usingg 32 nodes "(FPC (32, ...))".

2.88 Conclusion

Switchedd multi-computers such as the Parsytec GC/PowerPlus are powerful toolss for high-performance applications. LH*LH was shown to be an efficient neww data structure for such multi-computers. Performance analysis showed thatt access times may be in general of the order of a millisecond, reaching 0.44 msec per insert in our experiments, and that the throughput may reach thousandss of operations per second, over 2700 in our study, regardless of thee file scale-up. An LH*LH file can scale-up over as much of distributed RAMM as available, e.g., 2 Gbytes 18 on the Parsytec, without any access performancee deterioration. The access times are in particular an order of magnitudee faster than one could attain using disk files.

18Onee can note that this was on a Parsytec machine from 1995. Now (1999) we have an Orionn 2000 with 16 GB RAM and 8 processors, soon to be upgraded to 48 GB. 70 0 CHAPTERCHAPTER 2. THE LH*LH ALGORITHM

Performancee analysis also confirmed various design choices made for LH*LH.. In particular, the use of LH for the bucket management, as well ass of the concurrent splitting with the dynamic split control and the bulk shipping,, effectively reduced the peaks of response time. The improvement reachedd a thousand times in our experiments, from over 7sec that would characterizee LH*, to under 7 msec for LH*LH. Without this reduction, LH*LHH would likely to be inadequate for many high-performance appli• cations. . Chapterr 3 hQT*:: A Scalable Distributedd Data Structure forr High-Performance Spatial Access s

ThisThis chapter is essentially the paper published at the The 5th International ConferenceConference on Foundations of Data Organization (FODO'98), Kobe, Japan, NovemberNovember 1998. SeegerSeeger and Guido Moerkotte proposed that I make a 2-dimensional hashing SDDS,SDDS, using similar technique as for multidimensional "order-preserving" hashinghashing structures. Even though, I first finished my Licenciate thesis, moved toto Amsterdam, Holland and got somewhat involved in the Monet database system,system, the thought was always there, just waiting for to be awakened. Even- tually,tually, I got started out came some sketches for a 2-dimensional hashing datadata structure with some quite interesting new properties.

Abstract t SpatialSpatial data storage stresses the capability of conventional DBMSs. We presentpresent a scalable distributed data structure, hQT*, which offers support for efficientefficient spatial point and range queries using order preserving hashing. It isis designed to deal with skewed data and extends results obtained with scal- ableable distributed hash files, LH*, and other hashing schemas. Performance analysisanalysis shows that an hQT* file is a viable schema for distributed data ac- cess,cess, and in contrast to traditional quad-trees it avoids long traversals of hierarchicalhierarchical structures. Furthermore,Furthermore, the novel data structure is a complete design addressing bothboth scalable data storage and local server storage management as well as

71 1 72CHAPTER72CHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS managementmanagement clients addressing. We investigate several different client up- datingdating schemes, enabling better access load distribution for many "slow" clients. clients.

Keywordss Scalable Distributed Data Structure, Spatial Point Index, Or• deredd Files, Multicomputers

3.11 Introduction

Researchh is increasingly focusing on using multicomputers [Cul94] [OV91] [Tan95].. Multicomputers are built from mass-produced PCs and Worksta• tions,, often having special high bandwidth networks. Such infrastructures havee emerged and many organizations have typically thousands of machines withh an large total amount of RAM, CPU and disk-storage resources. Multicomputerss provided a challenge and promises to cope with the ever- increasingg amount of information using new distributed data structures. Scalablee Distributed Data Structures (SDDSs) form a class of data struc• tures,, first proposed in [LNS93], that allows for scalable distributed files thatt are efficient in searching and insertion. The approach gives virtually noo upper limit on the number of nodes (computers) that participate in the effort.. Multiple autonomous clients access data on the server-nodes using theirr image to calculate where the data is stored. Their images might be outdated,, but clients are updated when addressing errors occur. SDDSss are especially designed for avoiding hot-spots, typically a central directory.. Clients are autonomous and their directory information is incom• plete.. When a server receives a mal-addressed request from a client, the requestt is forwarded to the correct server, and an Image Adjust Message (IAM)) is sent to correct the client, improving its addressing information. Soo far several SDDSs have been proposed. Many of them are hash-based, suchh as LH*[LNS93] followed by DDH[Dev93], and [WBW94] and LH*LII [Kar97].. A number of B-tree style distributed data structures were also de• signed,, e.g., RP* [LNS94], DRT[KW94][KW95], and k-RP* [LN96b]. k-RP* allowss for multi-attribute distributed indexing. Increasedd storage demands of larger amounts of spatial data gave birth too many different data structures. They can be divided into the follow• ingg classes: Grid-style files, (quad)tree-structured files, directory-based, hash-based.. Combinations of hash-based structures and search-trees are calledd hybrid-structures. Grid-files [NHS84] experience problems with non- uniformlyy distributed data. So do many multidimensional order-preserving hashingg structures, for example MOLPHE[KS86], PLOP-Hashing [KS88] andd [HSW88]. The basic principle of these structures it that they map keys of severall dimensions into one dimension first, using for example z-ordering or similarr algorithms[KS86][Oto84], and applying an order preserving hashing 3.2.3.2. HQT* OVERVIEW 73 3 algorithmm [Tam81][Oto88] afterwards. Balancedd quad-tree-like structures (overview in [Sam89]) were developed too solve this problem. However, much of the real-world data is clustered in aa few spatial areas. The dense regions' data will, when inserted, be pushed downn deep in the tree structure by inserting empty nodes. This leads to ex• cessivee navigation to reach data. The BANG-file [Fre87] and BD-tree[OS83] weree designed to create more "compact" trees. Wee present hQT* which is a novel spatial (2-dimensional) scalable dis- tributedtributed data structure, that can be viewed as a hybrid hierarchical struc• ture,, with mostly hash-access performance. It imposes a successive more densee grid on square parts (regions) of the data domain. In hQT*, we access thee buckets bottom-up instead of the normally used method top-down, avoidingg long path traversals for data access. Furthermore, empty buck• etss need not be stored, they are created when data is first inserted. hQT* adaptss to skewed and clustered data, typical for spatial data. Uniform dis• tributionss should achieve similar performance as DDH [Dev93], but allowing forr 2-dimensional data also. hQT** is based on lessons learned from many data structures, it combines featuress from hB-tree [LS89] in that the splits are not restricted to a hori• zontall partitioning, and LSD-trees [HSW89] in that subtree extractions are made.. Finally, it has some similarities with the SDDS LH* [LNS93] using hashingg to enable as efficient performance for clustered data. Thiss is achieved through our bucket numbering schema. Each bucket, att any level in the imposed hierarchical grid, is given a unique number. Thesee numbers are used for identifying the correct bucket. Distribution is managedd similarly. The data structure is distributed by moving subtrees to neww nodes. The bucket numbers identifying the subtrees are then marked ass ForwardBuckets, and these make up the image of a server, clients have a subsett of these entries in their image. Thee rest of the paper is organized as follows. In Section 3.2 we give an overvieww of hQT*, followed by hQT* distribution, in Section 3.3. Section 3.4 describess the splitting algorithm. Measurements are presented in Section 3.5 andd Section 3.6 concludes the paper.

3.22 hQT* Overview

Generall principles for SDDSs as defined in [LNS93] apply to the hQT* data structuree as well. The hQT* file is stored on server nodes (computers), and thee applications access the data from client nodes. A server is assumed to bee continuously available for the clients, but the clients are autonomous. Clientss are not continuously available for access, may be off-line for longer periods. . Thee clients use an image for addressing data on the servers. This im- 7ACHAPTER7ACHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS agee can be outdated, causing the client to make addressing errors. Servers receivingg such a request forward the query towards the correct server using itss own image. AA file consists of records that reside in buckets. For the algorithm the relevantt part is the key, which identifies a record and is used to locate the record.. Buckets reside in main-memory (RAM) at servers. Each bucket has a fixedfixed capacity of records. Similarly, servers have limited capacity, too. Each serverr keeps data for a number of different spatial areas (regions). For each areaa several local buckets are kept in main memory, virtually arranged in aa (quad) tree hierarchy, using a numbering schema. The subtrees of hQT* storedd at one server are there seen as root-trees. Each such tree can be uniquelyy identified by the number of the root-node's corresponding bucket. Eachh server keeps a mapping, from subtrees (buckets) that has been moved too other servers, yielding a server identity.

3.2.11 Records

n n P P X X y y attributess / blob

I I applicationn record

Figuree 3.1: The Record Structure.

Thee records in hQT* store the spatial coordinate x, y and the associated attributess a\,...an. The layout of the record is shown in Figure 3.1. For efficiencyy the calculated pseudokey p is stored as well, avoiding unnecessary recalculations,, during reorganization. Additionally, a pointer n is used to groupp records together into a linked list to create buckets. The application dataa is kept in the tail of the hQT* record for efficient application access. Typicallyy the capacity of such a bucket is set < 10 elements.

3.2.22 Pseudokey Construction Wee map the 2-dimensional keys into a pseudokey, a bitstring of fixed length. Forr the current implementation of hQT* we construct the pseudokey as follows.. The 2-dimensional coordinate is mapped to one bitstring by inter• leavingg the bits of the bit-representation of the coordinate data {x,y). For examplee let X and Y be bit vectors of length 4 forming the pseudokey

PP of length 8. If X = {XZ)X2,X\,XQ),Y = (ï/3,ï/2,ï/i,l/o) the pseudokey is thenn P = (XQ, yo, x\, 1/1,^2,Ï/2, #3,1/3)- As can be noted P is the reversed interleavedd bitstring. This is to simplify low-level bit-programming, most algorithmss like LH first consider the lowest ordered bits, and so do we. 3.2.3.2. HQT* OVERVIEW 75 5

3.2.33 Bucket Numbering Bucketss are numbered uniquely using the pseudokey and the level. Briefly, thee idea is to impose a grid on each level of a quad-tree, using regular decomposition,, such that all grids align with each other when superimposed. Forr hQT* any "space-filling curve" can be used, z-ordering of the buckets iss shown in Figure 3.2. Our pseudokey use a reversed bitstring yielding a slightlyy more visually complicated numbering, but the principle is the same. Eachh layer is offset with the total number of preceding layers buckets. All

15 5 16 6 19 92 0 0 3 3 4 4 ^\^ ^ f^44 4 17 7 18 8 7 7 88 ^L>^2 1 1 2 2 5~ ~6 6 V~ ^~0 0

offsett = 0 offset = 1 offset = 5

Figuree 3.2: An offset space-filling curve first 3 layers. bucketss at all levels have a unique number. A bucket's number is calculated byy first extracting the appropriate number of bits from the pseudokey, and thenn adding the offset stored in a table. As can be seen from the first 3 grid layers,, in Figure 3.2, for level=0, the offset is 0; level=l, offset=l; level=2, offset=5. . Navigationn in the virtual tree is supported by functions that calculates parent,parent, and children bucket number. In-expensive table lookups are used to makee the mappings efficient.

3.2.44 Addressing Inn many SDDSs (LH*, RP*) there are usually two different types of address• ingg algorithms, one for clients, and one for servers. The client calculates the addresss where it believes the information resides, and the server that receives aa request checks whether the data is local or has moved due to a split. These calculationss are, of course, similar. In hQT* we use the same algorithm and dataa structure for both client and server address calculation, the difference beingg that the client, conceptually does not store any data1. Wee start with point queries to address local buckets. This then easily generalizess to distributed bucket (server-nodes) accesses. ^nn hQT*, caching of data could be realized through letting the client store data, too. 76CHAPTERR 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS

Locall Point Queries Givenn a point in 2-dimensional space, one way to find its associated data iss to follow a path from the top-node, shown schematically in Figure 3.3a forr two cases; A and B. A lies close to the root, and B is found further downn the tree. However, this has been identified as an expensive operation, becausee clustered data will be pushed down deep in the quad-tree. In a distributedd setting, this is even worse due to communication resulting in hot-spots.. In hQT* we start from the bottom (highest numbered existing

Figuree 3.3: Navigation in a) Quad-Tree b) hQT* level22 ) of the tree style structure, as shown in Figure 3.3b. Here we see that wee get a direct hit for B from our calculations, but for A we have to walk upp using the parent function towards the root through non-existing bucketss (dotted in the figure), until an existing bucket is found. Probing is donee locally, using hash-table lookups. This allows for hash access to highly densee areas on the lowest levels, giving direct hits with minimal overhead, whilee less populated areas require local probes.

Locall Region Queries Rectangularr region queries in hQT* use the implicit tree-structure among thee buckets. Initially, the algorithm searches for a local bucket that fully containss the queried region, in worst case giving the top-node, i.e. the buckett numbered 0. The subtree below is investigated, using the implicit tree-structuree in hQT*. Each subtree, is first tested if it is contained, or overlapss with the region in question. All the appropriate local buckets are visitedd by traversing the implicit tree.

2Anotherr level could be used as a starting point, but then distributed addressing would bee a bit more complicated. 3.3.3.3. DISTRIBUTION IN HQT* 77 7

3.2.55 File Growth Whenn the hQT* file is created, only one server is used. Later, then this server'ss capacity is exceeded, i.e. the server is overloaded and split. Figure 3.4, left,, shows a sequence of splits starting with the first split in Figure 3.4a. Successivee splits, each moving data into a new server is shown through Fig• uree 3.4b to Figure 3.4c. The first time 2 subtrees are moved, then second timee only one, third (c) 2, and last (d) again 2 subtrees. The spatial area of differentt subtrees do not overlap, apart from (inclusive) recursive decompo• sition. .

a)) b) )

1 1 1 1 -- -h h -- -1-- 4i i

c)) d)) CD

Figuree 3.4: Left: hQT* file key space partitioning by 4 successive splits. Right:: The equivalent quad-tree.

AA split is chosen in such a way that we minimize the number of squares (subtrees)) to move, they have as large coverage as possible, and it should movee half of the records (load). For now we just assume the existence of suchh a splitting algorithm, in the next chapter we will describe our splitting algorithmm in detail.

3.33 Distribution in hQT*

Inn hQT* both the clients and the servers use an image. The image stores the mappingg from a bucket number to the actual bucket, or a ForwardBueket thatt stores forwarding information. Using an array for the mapping would faill for skewed data. Instead we store the buckets in a hash-based structure. Too test the existence or to retrieve a bucket, an in-expensive hash table lookupp suffices. 78CHAPTERR 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS

3.3.11 Distribution (ForwardBuckets) Too handle distribution in hQT*, we introduce the ForwardBucket. A For- wardBuckett is a replacement for a subtree that has been moved from a server, itt replaces the removed subtree at its root. The spatial coverage of the For• wardBuckett is identical to that of the moved tree. The ForwardBucket is associatedd with the address of the server where the data was moved.

3.3.22 Distributed Point queries Inn a distributed setting, the client is first locally searched using the local hQT** addressing schema, resulting in one ForwardBucket instead of a leaf bucket.. The operation is then forwarded to the node associated with the ForwardBucket.. That (server) node is then searched, resulting in either a locall bucket if the data is stored at that server or another ForwardBucket, inn which case the request is again forwarded. Eventually, the correct server- nodee is found that stores the data in a local bucket. Noticeablee is that clients and servers use identical addressing operations, thee only difference is that only servers stores buckets with data. In practice, hQT** client addressing can be seen as a single forwarding operation.

3.3.33 Distributed Region Queries Distributedd region search is first performed locally in the client, as earlier described.. Whenever a ForwardBucket is found by the client, it is noted in aa list. This list serves to forward our query to the associated servers, these serverss are also given the list. These servers then repeat the same search as thee requestor, processing their locally found records. Discovered Forward- Bucketss are again noted. When all appropriate local records have been pro• cessed,, the server sends a finished-message to the requestor. That message containss the list of ForwardBuckets/servers, where to the local server will forwardd the query. Excluded from this list are the servers which the re• questorr already had knowledge about, thus already have contacted. These serverss in turn, perform the same operations till there are no more servers too forward to. The requestor keeps track of all servers it awaits responses from.. Whenever, a finished-message is received, the corresponding Forward- Buckett is removed from the list, and discovered ForwardBuckets on the remotee server are added. The query is finished when this list is empty. This algorithmm can be varied by requiring the forwarding-servers to collect the resultss themselves and then send the results back. In RP*[LNS94], broadcast andd multicast messages are considered. When broadcasts and multicast are availablee they can be used with similar performance. Clientss perform the query in exactly the same way as the servers, with thee only difference that clients do not store records. 3.3.3.3. DISTRIBUTION IN HQT* 79 9

Forwarding:: Image Adjust Messages AA client operation forwarded by a server incurs an overhead of extra mes• sages.. To minimize this overhead it is crucial that an SDDS inhibits the client too repeat the same addressing mistake. Clients are therefore updated using ImageImage Adjustment Messages (IAMs)[LNS93]. An IAM contains information thatt improves the image of the mal-addressing client. In LH*, this is an ex- tremelytremely efficient procedure, at most 3 forward messages are needed[LNS93]. Thiss is achieved through its strict linearization of buckets, which indirectly informm the client of the existence of other servers. Inn a tree data structure this is less so, but instead they cope with non• uniformm distributed data better. In Section 3.5.1 we will see that different clientt (and server) image updating schemas can improve the performance substantiallyy at the cost of more update messages.

3.3.44 IAM Policies Ass explained earlier, a forwarded request yields an Image Adjust Message (IAM),, this makes sure that the client does not repeat the same addressing error.. Naively, only the client needs to be updated. However, as noted in RP*,, DDH, and others, this puts a large load on the first server, that re• peatedlyy has to update the same client for different mistakes. The ultimate solutionn is to let the servers be updated by IAMs too when their forwarded requestt is again forwarded by another server. Below we present different strategiess used to investigate different aspects of image updating policies. Thee different strategies investigated are:

•• ONLYCUENT: only clients are updated •• FORWARDERS: all clients & forwarders are updated •• BROTHERS: as FORWARDERS, but the IAM also include all existing sibling33 ForwardBuckets. •• UPDATE/FORWARDERS: "update" walks up the tree at each server thatt forwards, registering all ForwardBuckets unknown to the client. Thiss is intended to give better load balance, by making servers share thee relevant information to clients earlier.

•• UPDATE/BROTHERS: as the previous one but with BROTHERS instead.

Naively,, every server could send updates to all servers that participate in thee current operation. In practice, the final server sends the IAM to the client directly,, possible piggybacked on the reply message. Then it would back• tracee the path the client's request was forwarded, updating these servers. Thee benefit of updating these servers is their decreased load by fewer mal- addressedd client-requests. 3AA sibling of a bucket is another bucket with the same parent. 8QCHAPTER8QCHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS

3.44 Server Splitting

Unlikee LH* [LNS93], LH*LH [KLR96] [Dev93], and other hash-based data structures,, our server splitting performs well also for splitting skewed data. LH** decides the split by a linearization of which buckets to create, which easilyy creates problems with unskewed data. Among the non-hashed based SDDSs;; RP* [LNS94], for instance, uses a simple interval division at the mediann value, but the efficiency of splitting and how to manage the locally storedd records as such is not addressed.

3.4.11 hQT* Splitting Inn hQT* a server is split by a simplistic, but effective algorithm that we calll Dissection Splitting. In most cases we achieve a near 50% split. The bestt split is characterized by three properties. First, it should be as close too 50% as possible. Second, it should have as large a coverage as possible. Third,, the number of identified squares should be as low as possible. Even iff these properties seem to contradict, in most cases the expected number off selected subtrees is below 2, and seldom over 3, which means that every splitt mostly generates two new ForwardBuckets, and two subtrees are moved respectively.. These subtrees may be rooted on different levels in the tree, butt are not overlapping.

3.4.22 Dissection Splitting Algorithm AA server splits when it is overloaded, by dissecting the forest of root-trees storedd at that server. The first server contains only one root-tree, covering thee whole spatial domain of the data structure. A split of a server is the processs of selecting a subset of (sub)trees to move. The subset selected will containn roughly half of the records (load). The weight of a tree is defined ass the ratio between the number of records contained in the tree compared too the number of records at the server, usually expressed in percent(%). Dissectingg a server involves "opening up" the quad-tree by decomposing it intoo its subtrees. Thee algorithm, shown in Figure 4.3, works as follows. Insert all roots of thee server to be split into the current working set; prune all below a certain threshold.threshold. Try all combinations; store the best combination. If there is a subsett that is inside our allowed split range, [50 — maxdiff, 50 4- maxdiff ]%, thenn the algorithm terminates. If the solution is not GoodEnough, we replace thee heaviest tree in the working set with its children, and start over in the loop,, till the algorithm terminates. Theree are mainly two parameters that control the algorithm. First, the ThresholdThreshold which tells us what sizes of trees to remove from the working set.. Second, the maxdiff the maximum deviation from 50% that we find 3.5.3.5. MEASUREMENTS 81

procc DissectionSplitting(server) = LL :— {root-trees stored in server}; SS :— nil; whilee (-iGoodEnough(S)) do iff (5 ^ nil) BB := remove heaviest from L; LL := children{B) U L; fi LL := {r : weight(r) > Threshold}; forfor VP C L do iff (BefcterThan(P, 5)) 55 := P; fiodod. procc GoodEnough(s) = abs(Weight(5)) - 50%) < maxdiff; . procc BetterThan(a, b) = GoodEnough(a)GoodEnough(a) A ||a|| < ||6||; .

Figuree 3.5: Split Dissection Algorithm

GoodEnough.GoodEnough. A solution is considered BetterThan than another solution if thee number of subtrees (S) identified to move is fewer. Experimentss show that at most 4 trees are chosen, the mean value off number of trees is 1.8. Further on, we observe that setting maxdiff - ThresholdThreshold performs well.

3.55 Measurements

Ourr measurements show the performance of hQT* in different settings. Scal• abilityy is shown using 456 servers in one experiment, and 3700 servers in another.. We investigate the efficiency of different IAM policies, since this is aa major concern for tree structured SDDSs. An efficient policy is vital for goodd load balancing among the servers. Another concern is how hQT* is affectedd by the data input order. For example loading data in the "wrong1' orderr can cause many structures to degenerate[KW95]. We display results showingg that hQT* automatically redistributes the load in case of ordered inserts. .

3.5.11 Efficiency of IAM Policies Inn line with other SDDS evaluations [LNS93][Dev93] we present performance figuress measuring the overhead of forwarding. Wre show the performance off different IAM policies, in the end choosing the most optimal and load- balancingg policy for hQT*. Thee experiments randomly generate 262,144 data points (coordinates) 82CHAPTER82CHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS inn the spatial domain. They are evenly distributed over the key domain ([0,65535]).. The mean number of messages over a series of clients is mea• sured.. One client is used at a time in these tests. The activeness of a client determiness the overhead, which is tested by restarting the client, with a probabilityy P after every insert, we show the value 1/P, roughly indicating thee lifetime of a client. Duringg the experiments the file grows from one server to several by splittingg when the servers are overloaded, reaching 3700 servers in the first experiment,, and 456 in the second. Thee acceptance interval for a satisfactory server split is set to a weight off [45,55]%, and the lightest subtree considered is set to 5% accordingly. Updatee messages for clients and servers are not counted directly, instead thee forwarding overhead is shown. There are several reasons for this. There aree several different efficient ways to implement each strategy, delayed up• datee of servers from servers, etc. For example, in the ONLYCLIENT option, theree is either a single message per forwarding, in which case the displayed valuess can be recalculated as new ~ {old - 1) * 2 + 1, or the final server sends backk an update message to the client, yielding considerable fewer messages. Otherr strategies, trace the updates back to the client from the final server throughh all the servers that did forward the request, in the end incurring thee double overhead (new).

37000 Servers Firstt we investigate, in a scenario where each server can hold at most 100 elements.. This gives approximately 3700 servers4. The values presented, in Tablee 3.1, are the average number of messages used for inserts, the count includess the forwarding messages. Whenn a client inserts 100 elements in a file distributed over 3700 servers, itit is likely to access around 100 servers. In the table the best strategy UP• DATE// BitO. gives only 1.81 - 1 = 0.81 extra messages at an average, indi• catingg the efficiency of SDDSs addressing and updating schemas. Att a first glance, one would expect the ONLYCLIENT strategy to give the worstt overhead, furthermore that BROTHERS does not seem to give much improvementt over FORWARDERS. However, this is an illusion. In Figure 3.6. wee show the number of forwardings that each of the servers numbered 1, 2, 3,, and 31 had to perform. In this run the probability (P) of a new client is 1/100,, and the same amount of data is entered. Forr ONLYCLIENTS, server 1 needs to perform nearly 23,000 inserts. Butt worse are FORWARDERS and BROTHERS that need nearly 200,000 andd 146,000 forwards for the same data. Now, three candidates remain,

4Obviously,, 3700 servers is a very high and unlikely number, full data availability will nott be feasible, but it shows the good characteristics of the data structure going in an extreme. . 3.5.3.5. MEASUREMENTS

1 1 6 6 BS S o o DS S td d &H H Q Q CH H CQ Q BS S PH H H H ü ü K K H H H H H H < < < < 1 1O O Q Q Q Q CH H CH H l/P P o o O O D D 100 0 3.98 8 1.98 8 1.90 0 1.94 4 1.81 1 11 000 2.53 3 1.73 3 1.53 3 1.63 3 1.44 4 100 000 1.46 6 1.27 7 1.16 6 1.23 3 1.15 5 1000 000 1.12 2 1.08 8 1.05 5 1.08 8 1.05 5

Tablee 3.1: Average messages inserting 262,144 data points using different IAMM policies using 3600 servers.

i i d d a a o o tri i a a OH H s s Q Q CH H CQ Q eS S a a H H td d o o H H <; ;

Tablee 3.2: Average messages for inserts using 456 servers. depictedd in Figure 3.6, ONLYCLIENT, UPDATE/FORWARDERS and UP• DATE/BROTHERS.. The first two candidates both incur quite a few for• wardss for the first server, but UPDATE/FORWARDERS wins for the re• mainingg servers, in total reducing the number of forwards. However, UP• DATE// BROTHERS is the clear overall winner, since the work of forwarding is furtherr reduced to less than half for the first server, sharing the load with thee rest of the servers. This is achieved through its faster client and server imagee updating schema.

4566 Servers Inn the second scenario, each server can at most hold 1000 elements. This resultss in 456 servers for similarly generated data, show in Table 3.2. Apparentlyy fewer servers incur less messaging, comparing with the 3700 serverss in the previous scenario. Again UPDATE/BROTHER is clearly the winner. . SACHAPTERSACHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS

1e+066 F

__ 100000

0} 0} (J) (J)

%% 10000

o o u. . 1000 0

100 0

[Att Server #]

25000 0 -11 1 I 1 T I I I 1 ! I I I I I I I 1 I OnlyClientss —i- Update,, Forwarders —x- Update,, Brothers 200000 -

ww 15000

en n x-- x-. 10000,r r

-** *- 5000 0

32 2 [Att Server #]

Figuree 3.6: a) Forward message count using different policies on servers 1,2,3 andd 31.b) Only the 3 most efficient strategies. 3.5.3.5. MEASUREMENTS 85 5

3.5.22 Server Load Distribution Onee major problem for many data structures, is that they assume that dataa is inserted unordered (random). Ordered input leaves these structures unbalancedd with substantially reduced performance. To assess the perfor• mancee of hQT* in very skewed distribution we use regional point data of thee SEQUOIA-benchmark [SFGM93] (file ca) . The file contains 62,584 Cal• iforniaa place names. The data is skewed with several dense regions (cities) withh many points, leaving other areas almost empty. Inn the first experiment, we insert the data in the order it occurs in the file,, sorted on the names of the places. In the second experiment we insert thee data sorted first on the X and then on the Y coordinate values, resulting inn a totally different input order. Byy first studying how the server 0 splits for different distributions, we can assesss that the load balancing is efficient through observing the number of regionss server 0 forwards queries to. Second, a common goal for data struc• turess is to linearize the splitting so that the splits are evenly distributed over thee time that data is inserted, effectively spreading the cost of redistribution too occur evenly distributed over time. 911 servers were created for the unordered inserts, and 101 servers using orderedd inserts. Servers were split when their local capacity of 1000 records iss reached. Forr the unordered set server 0 was split 6 times giving 11 regions (For- wardBuckets),, and for the ordered set 23 times giving 39 regions5. However, inn the ordered case, most of these areas were moved when the structure evolved,, in the end giving it the responsibility of only 13 regions, effec• tivelyy restructuring and balancing the distribution tree (forwarding struc• ture).. This is explained by viewing the number of Forward Buckets that are containedcontained in the subtrees which are moved. During a split, when a subtree iss moved, contained ForwardBuckets from previous splits are also moved. In effect,, the responsibility of these regions is given away to a new server with largerr spatial coverage. Viewedd at a distance, one notices that hQT* splits top-down for un• orderedd inserts, in a hierarchical manner, but for ordered inserts the splits partlyy occurs bottom-up, due to their clustered occurrence in the input stream.. However, since splits near the root occur later, they include already distributedd domains, yielding a balanced tree in the end anyway. In the plots inn Figure 3.7, we study the number of splits in respect to the number of in• serts.. The two curves show the ordered and unordered inserts, respectively. Thee server splits are shown on the X-axis, and the global number of inserts onn the Y-axis. Even if the "Random Input Order" grows slightly faster, be• ingg more eager to split, both of them show quite a nice spread of the server

5Thee number of non-overlapping regions a server holds is an indication of how much forwardingg it has to perform. 8SCHAPTER8SCHAPTER 3. SDDS FOR HIGH-PERFORMANCE SPATIAL ACCESS

70000 0 ii r Randomm Input Order 60000 0 Sortedd Input Order

50000 0 in in T T (1) ) w w 40000 0 *c *c 03 3 30000 0 on on O O 20000 0

10000 0

0 0 00 20 40 60 80 100

[Serverr Split#]

Figuree 3.7: Split distribution over "time". splitss over time.

3.5.33 Discussion Wee have investigated several strategies for IAM updating of the clients (and servers),, measuring the forwarding cost. One strategy, UPDATE/BROTHERS showss excellent performance compared to the others. The extensive mea• surementss show the messaging overhead to be low, using a client to insert 10,0000 records on 456 servers, incur a message overhead of 1% and 6% for 1,0000 records. Further on, it allows for faster client image adjustments, dis• tributingg the access load evenly on the servers. Wee show that the file can be scaled up to 3700 servers with reasonable addedd messaging cost. For example, a client inserting 10,000 records, in• curss an overhead of 15% while being updated knowledge of 3700 additional servers.. Comparing to a naive solution where only the client is updated, whichh gives an overhead of 46%. For more active clients (100,000) the over• headd is down to 5%. hQT** automatically adjusts the tree structure when ordered data (clus• teredd in spatial areas) are inserted. This is achieved by our Dissection Split• tingg algorithm that first considers distributing larger spatial areas. So, for aa skewed server splitting order the responsibilities of spatial areas are reas• signedd to different nodes. 3.6.3.6. CONCLUSIONS 87 7

Studyingg how server splits occur over the time, one can observe a near linearr function, even for ordered data inserts, achieving results comparable too linearizing hashing structures.

3.66 Conclusions

Wee have shown that hQT* is a well-behaving Scalable Distributed Data Structure.. It allows for spatial point data insertions and point and region retrieval.. hQT* is a 2-dimensional order preserving hashing structure. It is aa complete solution, in that it also stores and manages the local storage. Insertss can be performed in near 1 message, and point data access in near 2 messages.. Using a tenfold more servers, the same client still achieves similar performancee (the added cost is 5% for a client inserting 100, 000 records). Wee have investigated different IAM updating strategies, choosing a strat• egyy that updates the images on both the servers and clients. The chosen strategyy allows for a very low messaging overhead for active clients, 6% overheadd for a reasonable active client, and moderate overhead for less ac• tivee clients. Wee have shown that an hQT* file can scale to thousands of servers, still givingg acceptable performance. Increasing the number of servers from a few hundredd to ten times as many does not impose an increase in the overhead byy ten, but substantially less, depending on how active the clients are. Inn our experiments, the server splits occur evenly distributed in time, evenn for ordered inserts of data. Furthermore, it is shown that the tree- structuree automatically restructures on spatially clustered inserts in time. Inn the future we plan to extend hQT* to allow for n-dimensional data allowingg for Data mining storage and queries. Extended objects are also a challengee for SDDSs. Further on, database query processing over SDDSs, aa currently progressing project is underway to incorporate the hQT* data structure. .

Chapterr 4 ^-storage:: A Self Organizing Multi-Attributee Storage Techniquee for Very Large Mainn Memories

ThisThis chapter is based on the publication f2-storage: A Self Organizing Multi- Attributee Storage Technique for Very Large Main Memories presented at the AustralianAustralian Database Conference, Canberra, Australia, Janauary 2000. HereHere we present the concept of the Vt-storage, and identify the design space onon how to build the Q-trees. It is a combination of several choices. First, the attributeattribute bits are preferred in an order beneficial for range-queries. Second, bitsbits which would give a badly balanced tree are avoided. Third, the algorithm attemptsattempts enable each attribute to have the widest influence on the pruning byby a clever queuing of attributes to use.

Abstract t MainMain memory is continuously improving both in price and capacity. With thisthis comes new storage problems as well as new directions of usage. Just beforebefore the millennium, several main memory database systems are becoming commerciallycommercially available. The hot areas include boosting the performance of web-enabledweb-enabled systems, such as search-engines, and auctioning systems. WeWe present a novel data storage structure the fi-storage structure, aa high performance data structure, allowing automatically indexed storage ofof very large amounts of multi-attribute data. The experiments show excel- lentlent performance for point retrieval, and highly efficient pruning for pattern searches.. It provides the balanced storage previously achieved by random kd- trees,trees, but avoids their increased pattern match search times, by an effective

89 9 900 CHAPTER 4. Ü-STORAGE: MULTI-ATTRIBUTE STORAGE assignmentassignment bits of attributes. Moreover, it avoids the sensitivity of the kd- treetree to insert orders. Keywords:Keywords: Main-memory multi-attribute data storage, self organizing, pat- terntern matching, based-on-bits

4.11 Introduction

Manyy unconventional database applications require support for multi- attributee indices to achieve acceptable performance. Decision Support Sys• temss allow users to analyze and use large amounts of data online. Queries mayy use several attributes simultaneously. Using a main-memory multi- attributee index can greatly speedup interactive analysis for the online user. Thee holy grail in database systems is a data structure that supports multi-attributee indexed storage, that has minimal insert overhead, and yields highlyy accelerated searches over very large amounts of online data. Wee can observe from the abundant literature that most multidimen• sionall data structures fail one way or another, either for a high number of attributess [WSB98], or when the data is not evenly distributed [NHS84]. Mostt schemes are static in their partitioning, assuming total randomization, whichh lead to multi-dimensional hashing of different kinds. Other schemes usee adaptive and dynamic partitioning schemes, often resulting in the cost off large main-memory overhead instead. Thee ri-storage technique proposed here is a novel design. It is an au- tomatictomatic and adaptive indexed storage technique. It requires no tuning or programmer/applicationn selection of indices. Indexing is only performed for dataa when beneficial in terms of balanced storage under inserts, keeping thee indexing overhead low. The ^-structure is optimized for high perfor• mancee record retrieval and searches, while allowing incompletely specified searches,searches, i.e., searches where only a subset of attributes' values are known, alsoo known as pattern searches. The fï-tree is a dynamic tree data structure thatt copes well with varying data distributions. Point inserts and retrievals aree completed in logarithmic time. Inn contrast to most multi-dimensional hashing schemas. the O-tree ex• ploitss the data skew. It ignores bits which have no use for indexing, providing highlyy efficient and adaptable incremental reorganizations. Moreover, data insertss in sorted order on one or several attributes hardly affects the shape of thee resulting tree. Experiments in Section 4.4 ascertains this by comparing withh a kd-tree which experiences high skew.

4.22 Related Work

Inn this section we give a short overview of different partitioning methods. Thiss area of research started out optimizing the usage of narrow resources, 4.2.4.2. RELATED WORK 91 1 suchh as main memory, by reducing disk accesses, and limiting CPU usage. Now,, during the 90's the scenario has evolved to the needs of new application areass focusing on high availability and high-performance accesses. This is achievedd by index structures that use main memory (sometimes distributed) too automatically manage highly dynamic datasets and which can adopt itself too different distributions, avoiding the deficits of earlier indexing methods' worstt case behaviors. Forr static data sets, one can employ a choice-vector which defines what bitss from what attributes to use. Furthermore the bits can be chosen in such aa way that recurring queries run fast. This is shown in the multi-attribute hashingg structure proposed in Towards Optimal Storage Design Multi- attributeattribute Hashing [Har94]. Two strategies are investigated for the selection off the bits, one method gives each attribute equal chance of being used, thee other gives the minimal bit allocation, also referred to as the optimal allocation. . Manyy multi-dimensional storage structures are based on the idea of map• pingg several dimensions into a one dimension and then exploit the highly in• vestigatedd field of one dimensional data structures. An effective scheme is to usee multi-dimensional (order preserving) hash structures. A pseudokey (bit- stringg of fixed length) is constructed by interleaving bits from the different attributes.. During the insertions of data into the storage structure, an in• creasingg number of bits are used to organize access to data. Different strate• giess include MOLPHE [KS86], PLOP-hashing [KS88], quad-trees [Sam89], kd-triess [Ore82], and others [Oto88] [HSW88] [Tam81]. However,, it is common in these statically defined hashing schemes that whilee some bits occur to be "random" others are totally useless for indexing andd leads to unbalanced structures. Thee prominent tree-based structure for multi-attribute searching is the kd-treee [Ben75]. It is a binary tree. The discriminator in internal nodes, was originallyy limited by strict cycling through the attributes, attribute i%k at levell i in the tree. Later, the optimized kd-tree [FBF77] was introduced, storingg the records in buckets, and choosing the attribute with the largest spreadd in values as discriminator, using the mean value for partitioning, kd-treee were then introduced as a general search accelerator for searching multi-keyy records by suggesting means for storing data on secondary storage devicess [Ben79]. Rivestt PhD Thesis [Riv74] analyzes, among other structures, a kd-tree stylee structure using a binary bit from the data as discriminator. The per• formancee of Tries [Riv74] are also analyzed which parallels the analysis the kd-trees.. Here the discriminator of a node is chosen so that it has not been usedd higher up in the tree. Thiss overview demonstrates that both static and dynamic methods can supplyy only a partial solution to the problem space. The fl-storage combines thesee methods in a way explained in the next section. 922 CHAPTER 4. Q-STORAGE: MULTI-ATTRIBUTE STORAGE

4.33 The Q-storage

Wee now explore the design space of the ^-structure using a dynamic tree structuree to efficiently prune the search space. The tree uses actual bit values fromm the attributes to organize the tree during the split of a leaf-node into severall new nodes. A number of split-strategies are discussed in Section 4.3.4. Wee show, in Section 4.3.1, how the records are stored in buckets clustered onn attributes, and, in Section 4.3.3, we show how the structure is searched efficiently. . Forr simplicity it is assumed that all attributes are discrete and of lim• itedd cardinality. Furthermore, for simplicity we assume the domains of an attributee to be compacted to the range [0..2N - 1], where N bits are needed too store the data.

4.3.11 Buckets and Branch nodes

Firstt Bucket L=00 Count=10 Attribute e Domain n Bits s Queue-(X,Y,City,Age,Sex) ) 111 1 Age e [0..120] ] 7 7 053 3 09 9 City y [0..10] ] 4 4 027 7 10 0 092 2 04 4 X X [0..7] ] 3 3 047 7 00 0 Y Y [0..49] ] 6 6 033 3 02 2 044 4 10 0 Sex x {male,female} } 1 1 098 8 10 0 081 1 03 3 Agee City X Y Sex

Figuree 4.1: A bucket of an fl-tree and its attributes.

Thee Sl-tree consists of two components, the /ea/-nodes, which store the data,, and the branch-nodes, which organize the access structure. A branch iss defined by the split-points of a node. A split-point is a tuple .. In general there can be 2" P t-points Drancnes The leaf- nodesnodes {buckets) contain vertically partitioned records, i.e., using one array perr attribute. Vertical partitioning has been shown to give supreme perfor• mancee in Monet [BWK98] [BK95]. A mimimal O-tree consisting of a single buckett is shown in Figure 4.3.1. Figure 4.3.1 show the stored attributes. The domainn column shows the discrete value span for the domain, the column "Bits"" shows the number of bits required to store the normalized domain. Thee branch-nodes, lead to branches at lower nodes or to leaf-nodes. The characteristicss of the branches is decided at split time. 4.3.4.3. THE SI-STORAGE 93 3

4.3.22 An Example Inn Figure 4.2 we show a more elaborate O-tree. The branch-nodes have a set off split-points. A split-point pose limitations on the subtrees that are given byy following the branches. Each bucket's domain is completely and uniquely specifiedd by it's path from the root node. For a record to be stored in a buckett it has to fulfill the conditions summarized in the box of the bucket.

64<=age<96 6

(Clty,3) )

Figuree 4.2: A "typical" fi-marshaled tree.

Thee "root-node" the leftmost node splits the tree into two branches. Theree is only one split-tuple in that node, namely (City, 3), which indi• catess that the tree has two branches split using the 3rd bit from the attribute City.. Bits are numbered 0 from the Least Significant Bit (LSB) to the Most Significantt Bit (MSB). Since the split uses the highest bit of city (bit 3), itt divides its domain into two intervals. One being city<8, and one being cityy > 8. At the next level, splits have been decided independently in the twoo sub-trees. The city<8 branch splits on age, (Age, 6), giving the two branchess age<64 and age>64. The lower sub-tree, city>8, was split using twotwo attributes bits, again (Age , 6) but with (Sex, 0), giving 4 branches. The identityy (additional restrictions) of the branches can be seen in the figure. In thiss tree, both nodes of the second level are split again on the age attribute usingg the 5th bit (value = 25 = 32), further dividing the intervals. Inn some cases, a bit will not be used, as can be seen in in the uppermost rightt node in Figure 4.2. There there is a node split using (City, 1), bit (City,, 2) has been "skipped" over. The reason is that the bit was of no usee for splitting for all records the bit have the same value. However, when searchingg this is not known, therefore there is an uncertainty about which domainn the sub-tree a specific value belongs to. This creates a complex 944 CHAPTER 4. iï-STORAGE: MULTI-ATTRIBUTE STORAGE activee interval for the resulting buckets. We have depicted the domain of thee sub-tree for the city attribute as city="OXOX" and city="OXlX". The "X:s"" can still be used in a further split in this sub-tree. When searching ann explicit value using only city, still only one branch needs to be visited. Ann interval search on city both branches may have to be visited. However, studyingg the domain set of the two buckets we find that the first stores city GG {0,1,4,5} and the second city G {2,3,6,7}.

4.3.33 Point Searching Too locate the position in the tree for a given record we start navigating the treee at the root node. By examining the branches at the current node we can decidee which branch to follow reaching a new node. The process is repeated, eventuallyy leading up in the appropriate bucket. A branch is chosen if the nodess split-points' values on the branch agrees with the same bits in the record. . Similarlyy to kd-tree the search/insert complexity is logarithmic [Ben75]. Incompletelyy specified searches are performed similarly, but may, enter sev• erall branches and buckets.

4.3.44 Splitting Strategy Duringg the growth of the tree, buckets will become overloaded, i.e., reach• ingg their storage capacity causing them to split. A split is performed by partitioningg the content of the current bucket into some new buckets and replacingg the current bucket by a branch node. The partitioning is defined byy a split-point. Which split-point is chosen depends on the split-strategy employed.. More explicitly a split strategy defines; the attributes to consider andd their order, and in which order the bits of the attributes are preferred, andd which bit value distributions are acceptable. A bit is acceptable when thee count of l:s over the records in the bucket is in the percentage range [50%% - A, 50% -f A], where A is a structure parameter, further investigated inn Section 4.4.5. More significant bits are preferred. Wee use a new split strategy called fi-marshal, which fulfills a number off goals. First, all attributes should be given a chance of being used in the decisionn split-points in the tree structure. Secondly, it aims to use attributes inn split-points on the whole width of the tree, to guarantee efficient pruning. Third,, bits are used for easy splitting and organization of the tree. Fourth, bitss are to be preferred in such an order that range-queries would bene• fit.fit. And, finally, bits are chosen by a local split operation only if they are acceptable. . Alternativee strategies are used in: randomized kd-tree [DECM98], and kd-triee [Ore82], and ^-pseudo [KK99]. These strategies have been found to havee their limitations. This is further discussed in [KK99] where a metrics is 4.4.4.4. PERFORMANCE EVALUATION AND TUNING 95 5 developedd to shed light on their inner workings. Based on these experiences wee have designed the S7-marshal strategy.

Splittingg Algorithm Thee pseudo-code in Figure 4.3 describes the details of how a bucket is split inn the f2-marshal structure. If a bucket after an insert reached its LIMIT itt is split and replaced by a new internal node. The new node contains branchess to newly created buckets. For efficient splitting, using our vertically partitionedd storage schema, we first determine the split-point. The split-point iss determined by searching the attributes from the queue in the bucket. The firstfirst attribute with an acceptable bit is chosen. Whenn a split-point has been found, it is used to create a splitvector thatt holds the destination bucket for every attribute. Both the search for aa split-point and creating the splitvector requires sequential accesses only. Thenn in Split, the attributes are moved sequentially to the new buckets. Thee buckets are then assigned a queue where the used attribute has moved too the end, enabling a cycling through the attributes used in the queue.

procc Determine splitpoint(array Records, array Attributes) = forr Va G Attributes do forr Vr E Records do updateupdate count(r.a) od forVtöforVtö G 31..0 do iff (acceptable count {bit)) returnreturn < a, bit > fi od od.

procc Calculate Splitvector(array Records, a) = arrayarray move[1..LIMIT] forr Vi E 1.. LIMIT do move[i]move[i] := (Records[i].a.bit) od returnn move.

procc Split(array Records, array Attributes, array move) = arrayy Burlcet[0..1]; forr Va G Attributes do forr Vi G 1..LIMIT do addd Record[i].a to Bucket[move[i]} od od.

Figuree 4.3: Bucket Split Algorithm

4.44 Performance evaluation and Tuning

Inn this section we analyze and benchmark the ^-storage structure. We con• firmm that single1 records inserts and searches conform to 0(log(n)), and that thee time for partially specified record searches decreases exponentially with 966 CHAPTER 4. Ü-STORAGE: MULTI-ATTRIBUTE STORAGE thee amount of attributes specified. Comparison is performed against the kd- treee showing excellent search performance, and highly improved stability in singlee record search times, enabled by the better balanced tree. The number off internal nodes used by the restructure is 14K compared to the 24K of thee kd-tree. Forr our experiments we use an SGI Origin 2000 currently equipped with 244 CPUs and a total of 48 GBytes RAM. A 64-bit process can transpar• ently,, from the programming point of view, access all its RAM. However, theree are extra costs. Each CPU has "local" access to 2 GBytes RAM and "remote1'' memory is cached. The operating system may move processes from onee CPU (and memory) to another if it decides that this would be beneficial forr the process, because a large number of remote memory accesses can be eliminatedd by executing the process at an other CPU where that memory is local. . First,, we find the optimal capacity of an J)-bucket. Then we discuss the insertt performance.

4.4.11 Bucket Size vs Pruning Althoughh the Sl-tree is an automatic storage schema, there are still a few parameterss of interest to tune. These parameters depends on the underly• ingg hardware, i.e memory access costs, memory access patterns and cache- performance. . Wee identify two such parameters for the fi-tree. The first parameter is thee average bucket size implied by an upper LIMIT on the bucket size. This parameterr is affected by the target hardwares' cache capabilities. The second parameterr is the acceptable 0/1 frequency of a bit to be considered in a split. Iff the bucket size is too large, a single element search exhibits time linear too the size of the bucket, and if the bucket size is too small, we will spend moree time navigating the tree structure. Therefore, we choose to determine aa LIMIT large enough not to influence single element search times.

Buckett Size Max x Mm m Avg g Stdd Dev 50 0 17 7 15 5 16.5 5 0.82 2 100 0 18 8 16 6 16.7 7 0.82 2 200 0 21 1 15 5 16.9 9 1.51 1 1000 0 25 5 18 8 20.1 1 2.23 3 2000 0 38 8 19 9 26.5 5 5.40 0 3000 0 44 4 25 5 34.2 2 6.72 2 10,000 0 130 0 44 4 77.5 5 32.63 3

Tablee 4.1: Statistics on point searches, varying bucket limits, times in [/is].

Tablee 4.1 shows experiments of singe-point searches where the upper buckett size limit is varied between 50 up to 10,000 elements per bucket. 10 MM records with 8 attributes were inserted using 320 MBytes of memory just forr the record storage. The search time is dominated by the navigation time. 4.4.4.4. PERFORMANCE EVALUATION AND TUNING 97 7

10000 10000100000 1 e+06 1e+07 Buckett Size

Figuree 4.4: Varying a) search times b) insert times.

Forr a limit up to 1000 records the search time is stable, starting at 15 (is for buckett of size 50, going up to approximately 20 [is at a limit of 1000. Beyond thee 1000 records the actual size of the bucket is reflected or dominates (over 4000)) the search times. Figuree 4.4(a) depicts the search time for different searches using 1,2, and,3.. Using less attributes then querying increases the search time, with aa local minima at 200 to 1000 and around 5000 to 10000 limited bucket size.. For the remaining experiments we choose the maximum bucket size to bee 1000 since point searches are predictable and fast. Others searches are reasonablee in search time. Inn Figure 4.4(a) there is a peak at a LIMIT of 2000. The explanation is simplee at this point the number of buckets decreases, and with that the numberr of branch-nodes causing the tree height of the tree to be somewhat lower.. For these queries this had the effect that the search times increased sincee a vital index level vanished.

4.4.22 Insert costs Figuree 4.4(b) shows the total insert times for 10M inserts in seconds for the fi-treefi-tree using increasing bucket sizes. The overhead consists of function call costss and the cost of tree reorganizations. As a reference we show the time forr inserting the record into one array each for the attributes. The quotient betweenn the insert time of the O-tree and the insert into arrays roughly approximatess the height of the tree, thus reflecting the number of times a valuee has been copied during a split. Forr example inserts with LIMIT = 1000, uses 145 seconds, 12.1 times as longg as linear storage. This insert time is related to the current tree depth, inn this case approximately 14. Increasing the bucket size by a factor of 10 (LIMIT(LIMIT = 10000) just causes a slight decrease to 10.6 times as long. Still, at LIMITLIMIT = 10,000 the actual time per insert is only 14.5 (is. Inserts using the O-treee shows a final overhead of 22 seconds for an "infinite" large bucket. 988 CHAPTER 4. Ü-STORAGE: MULTI-ATTRIBUTE STORAGE

1-attributee : 2-attribute>c'---x--- 3-attriferi!éss ---*- 8-attfibutess e

200 40 60 80 100 120 140 160 00 20 40 60 80 100 120 140 160 [MM elementsl [MM elements]

Figuree 4.5: Search time using a) 1,2,3,8 attributes in O-tree b) details of attribute. .

4.4.33 Search cost for a growing data set Firstt we ascertain that a point search is a highly efficient and fast operation. Ass can be seen in Figure 4.5(b), the cost starts at 19 /xs for 1 million tuples too increase to 27 /us for 140 times more data! Whenn querying larger data sets, as shown in Figure 4.5(a), query times aree significantly higher. For these specific queries the query time as well as thee result set size increase linearly with the file size. The 3 attributes query risess to just above 400 ms, 2 attributes gives a search time of 2000 ms, nearlyy the performance of linear scanning. Whereas 1 attribute is somewhat moree efficient. The reason here is that the result of the 2-attribute query is aa subset of the results of the 1-attribute query and that there is no index availablee for the second attribute, thus the same amount of data is scanned butt requiring two attributes testing. Inn Figure 4.6(a) we compare the fi-tree pattern search performance with linearr scanning. Point search, in this case 8 attribute search, gives the highest improvement,, with a search time negligible compared to scanning. The other queriess improve the search time by a factor of 3 to 10 times over scanning. Thee improvement ultimately depends on the search pattern specified and thee result size. Too assess the average performance of different queries, we observe all partiall match queries (256 for 8 attributes), using a subset of the attributes fromm a specific record. The resulting plot is shown in Figure 4.6(c). The averagee performance quickly improves when more attributes are present. Decreasingg from seconds to micro seconds for point queries. The best per• formancee 14 /is - is achieved when all attributes (8) are specified in the query.. Using 7 attributes gives results in the range from 20 /us up to 299 fxs, withh an average search time of 50 /us. 4.4.4.4. PERFORMANCE EVALUATION AND TUNING 99 9

10000 0 11 1 1 1 1 X--.^ ^ 88 attributes - 1000 0 St^* * 166 attributes --X- — — "X E E 100 0 v v E E 10 0

Ü Ü 1 1 1 1 V V CD D -- X 0.1 1 -- ^v v 0.01 1 1 1 00 20 40 60 80 100 120 140 00 10 20 30 40 50 60 70 80 90 100 [MM elements] %% attributes present in query

1 1 1 1 1 1 I I Maxx - 1e+077 . Scann - :--. -*.. . . Avgg - 1e+06 6 \\ Min ** -. 100000' u' u '*-. . : : "X-. . 10000 0 ------"v v x-- \ 1000 0 '"*-. . 100 0 -- 10 0 ' ' ' ' i i i i 22 3 4 5 6 22 4 6 8 10 12 14 16 attributess present in query attributess present in query

Figuree 4.6: a) Search times b) 8, 16 attributes files c) 8 attribute file e) 16 attributee file. 1000 CHAPTER 4. Ü-STORAGE: MULTI-ATTRIBUTE STORAGE

4.4.44 Influence of Number of Attributes Too investigate how $7-tree performs for a higher number of attributes we buildd files with a varying number of attributes. We have already shown thee performance for 8 attributes, and will now quickly compare it will 16 attributes. . Figuree 4.6(d) shows statistical values for seaching a 16 attribute file. Whenn the information of the pattern search is increased the average search timess decrease exponentially. For example, when 10 out of 16 attributes are known,, the average search time is approximately 10 ms, and for 12 out of 16, itt is about 1 ms. The individual searches do, however, exhibit a significant variancee in their search times in the order of magnitues. The curves for thee 8 attribute file, Figure 4.6(c), have similar characteristics. Observingg the minimum and maxium search times, it is clear that they aree orders of magnitudes apart. Comparing the case of the 8 attributes to thatt of the 16 attributes file, one notices that the spread (variance) of the searchess decreases. In this case of the 16 attribute file each record has the doublee amount of data (bits) stored in it, giving more choices for the split algorithmm to choose a bit for the split. This gives a higher chance of a better load-balancedd structure, by giving more options to fullfill the goals during thee split.

4.4.55 Comparison with kd-tree Wee now compare the performance of the U-tree with the performance of the kd-tree,, using an 8-attribute file, searching 10M records. Figuree 4.7(a) shows the performance of the kd-tree and Sl-tree 10% rel• ativee to the high performing S)-tree 20%. For the S)-tree 10% the overhead iss reasonably stable around 10-20%, but for the kd-tree it starts at around 30%% going up to 10 times as much for searching records using 8 attributes. Thee reason for this is the skewed kd-tree which for most of the data in thiss experiment got very deep thereby causing large overhead to accessing individuall records at the leaves. Figuree 4.7(b) shows the standard deviation observed for the kd-tree and fi-tree.. For less specified pattern searches the standard deviation is slightly higherr using the kd-tree than for the fi-tree. For highly specified patterns thee deviation on the kd-tree does not improve compared to the Si-tree. The S7-treee comes near a perfectly stable search time with a very low variance of 2. . Thiss is due to the inability of the kd-tree to handle different data dis• tributionss and insert orders, the kd-tree might be very skewed giving very fastt access to some records that are near to the root but performing poorly whenn searching for records that are stored further down in the tree. The J7-treess on the contrary exploits these skewed distributions and insert order 4.5.4.5. EXPLORATION OF THE iï-TREE DESIGN SPACE 101

„„ 100 r

O O ££ 10 r i i c c

123456788 12345678 #attributess attributes present in query

Figuree 4.7: 8 attribute file, standard deviation for patterns search a) The kd-treee and fi-tree 10% compared with the f2-20% b) KD compared with O-treee with acceptance limit of 10% and 20% providingg more stable search performance. Wee make two observations: First, increasing the acceptance interval, as definedd in Section 4.3.4, from A = 10% to A = 20% gives a slight but stable performancee increase. The only noticeable drawback is that it increases the amountt of buckets needed to store the data from around 14K to 15K with thee depth still around 14 levels. This is to be compared with the 24K buckets generatedd by the kd-tree and its depth of 25 levels. The second observation iss that the performance of the O-tree is much more stable than that of the kd-tree.. The O-tree avoids creating skewed trees in cases where the kd-tree would. .

4.55 Exploration of the Q-tree design space

Thiss Section seeks to give some background information on the design choicess that were made during the development of the O-storage.

4.5.11 Branches (Binary)) tree structures have the disadvantage over hashing structures that theyy require substantial navigation effort, visiting large number of disperse memoryy locations. The number of navigation steps can be reduced by having moree branches in each branching node, for example 2' elements instead of 2 ass in the binary tree. However, additional costs are associated with calcu• latingg the appropriate branch(es), i.e. a cost similar to the cost of hashing. Inn our case, we tried solutions which just "compacts" the binary tree by H u combiningg several split-points into one node, giving 2# P t-pomts branches. Thee additional cost of combining the splitpoints into one value to be used for navigationn turned out to be higher or the same for the navigation avoided. Forr such an approach to be fruitful a large number of branches are needed. 1022 CHAPTER 4. ^-STORAGE: MULTI-ATTRIBUTE STORAGE

Linearr scanning is faster than complicated tree-navigation or hashing up to aa certain number of elements. This is due to the added complexity which hashingg and tree navigation adds. Adding more branches also reduces the flexibilityflexibility of the organization of the tree as well as increases the storage overhead. .

Reducingg dimensionality Thee high-dimensional curse causes problems with the building of the tree. Forr example, for a balanced k dimensional kd-tree, at least 2k leaf-nodes are requiredd for each dimension to affect the organization of the tree, i.e. to aid thee navigation. Byy avoiding splits directly along the axis (dimensions) and by combining dataa from different dimensions in a non-linear fashion the actual dimensions cann efficiently be reduced. Hashingg functions are well researched. They are used to create a char• acterizee fixed-length bitstring from a variable or arbitrary large sized in• putt data. This fixed-lenth bitstring/number is called a pseudokey. Hash- functionss investigated in the literature focus on creating an apparent ran• domm mapping from the input data to the hashed value. Hashing provides uncomparablee efficient retrieval when all the input data to the hash-function iss known, since it calculated the location where the requested data can be found.. One problem often faced by hashing concern storage overhead due to dataa skew. Another, more fundamental problem, is the inability to perform efficientt pattern searches, searches where not all of the input data (for the hash)) is known, ending up searching all of the stored data. This is the effect off the normally sought randomness of the hash-function. Byy using a table-lookup hash-function, as in ADB hashes [Riv74] the inputt and output data can be related in such a way that a search using incompletee input data still limits the amount of data needed to be searched. Thee hash-function of ADB-hashing "compacts" a number of bits in a bit- stringg to smaller number of bits by introducing a "don't care" symbol ("*v) inn the mapping. It reduces the number of buckets to search when bits are missingg in the pattern matching queries by allowing more bits to take part inn the indexing schema. There are two drawbacks for the ADB hashes: First, theree is an added complexity because of using a table for implementing the hash-functions;; Second, there is a penalty for searches that use fewer at• tributess since more buckets then need to be searched. Wee will not explore these methods any further in this PhD, but we notice thatt these methods closely resembles operations in the field of coding theory. Furthermore,, we do not here consider any structures which do not use bits forr organizing the search spaces. For those structures it has been proposed too use mathematical tools such as fractal-coding. 4.5.4.5. EXPLORATION OF THE Ü-TREE DESIGN SPACE 103 3

4.5.22 Dynamic hash-function Normally,, a hash-functions output is predictable since it only depends on thee function itself and the input data. Relaxing this constraint one can buildd a mapping which is knowledgeable about the data that the function organizes.. In effect letting the current structure act as additional input to a newlyy constructed hash-function. The hash-function's behavior is updated att times to allow for balancing storage and a growing data set. Typically, thee hash-function would state how many significant bits the output contains forr each output. The hash value produced would vary, at different times over thee existence of the storage structure, reflecting the movement of the data itemm inside the structure. LHH [Lit94] has a dynamic hash-function. In the case of LH, only the amountt of data stored (or the "load") controls which hash-function is used. DHH [Lar78] and hQT* [Kar98], on the other hand, uses a different number of bitss from a statically generated pseudokey. The number of bits used depends onn the local position in the structure and the local load. When a part of the structuree is overloaded more bits are used for organizing the data in this part.. In effect DH generates a tree structure. The structure itself can be saidd to be contained in the mapping which defines the hash function. In thesee structures, however, a bit's value is purely determined from the item itself.. Only the number of bits used varies in different parts of the structure. Contrary,, to having the bits purely defined by the input data, the input dataa and the current state of the structure can be used to choose how the outputt should be chosen. This allows for larger flexibility in organizing the structure. . Forr example, a binary tree can be seen as a dynamic hash-function that generatess bit by bit when traversing the tree. When a leaf node overflows, it iss split in two leaves causing a branching node to be inserted into its place. Suchh a node essentially poses a decision of right/left, i.e. 0/1. The hash- functionn "grows" with the input data. At each moment an item in the tree hass a unique hash value associated with it. When the tree grows by a split off a leaf-node another bit is added to the output of the hash-functions for a groupp of items. In practice, such a hash-function or mapping is implemented byy a tree structure.

4.5.33 Explored variants of f2-trees Thee current S7-tree is essentially a binary tree with a special function for selectingg how a split is to be made. During our early exploration of the Q designn space the following methods were considered.

•• D-random: Random choose any non-used bit which is acceptable at buckett split time. This structure is similar to the randomized kd- treess [DECM98]. 1044 CHAPTER 4. Ü-STORAGE: MULTI-ATTRIBUTE STORAGE

•• O-stiff: k-d tree style splitting schema where the "best bit" from a list off cycled attribute is chosen, resembles kd-tries [Ore82].

•• Q-pseudo: "Cleverly" interleave bits of differently sized domains to createe a fair and easy split of buckets, by limiting the choices, and promotingg attributes equally, independent on their domain sizes.

•• Q-marshal: A relaxed schema with the advantages of H-ötifT differing inn that more significant bits are preferred. If no acceptable bit is found anotherr attribute is chosen. The attribute skipped at one level has a higherr priority to be chosen at further splits in the subsequent subtree.

Wee summarize the experience for our structures below. il-randomm achieves a well-balanced storage, however the pattern search iss more costly because no guarantees are given for the number of nodes that prunee an attribute. Thee fl-stiff has some of the advantages of the kd-tree by cycling through thee attributes. This creates "barriers" on every level for one attribute that efficientlyy prunes half of the branches in a search using that attribute. The disadvantagee is that the predetermined attribute may be of no use for a split, causingg internal nodes in the tree to be just "dummies". The tree generated iss highly unbalanced, however, it achieves slightly better pruning, thus more efficientt search for well-behaved data distributions. f2-pseudo,, being a combination of fl-random and O-stiff, tries to be more relaxedd avoiding the problems of the unbalanced trees of Sl-stiff by allowing aa certain amount of randomness at each level. It is designed to mimic the bit- interleavingg of the multi-dimensional hashing structures, while being more flexible.flexible. The drawback is that highly random attributes are favored above others,, and it does not achieve as efficient pruning as the structures with "barriers"" (kd-tree, fi-stifF). Thee Q-marshall, as described in this chapter, efficiently stores data, as welll as allows for a stable efficient retrieval compared to the kd-trees.

4.5.44 Implementation Notes Thee language C was used for the implementation of the O-structures. Opti• mizationn were made to avoid many function calls by allowing the compiler too inline functions. The search function operates by a performing a recursive calll at each level. Since,, the depth of a large tree can be 10-20 levels, the costs of recursion callss were of concern. We instead the use of an iterative solution by the meanss of our own stack. It turned out that this was not an important issue, thee overhead of building your own stack and iterating using this instead led too no substantial improvements in speed, only more complicated code. 4.6.4.6. CONCLUSIONS 105 5

Thee most important gains in speed were made by streamlining bulk operations,, such as scanning, splitting and copying data from a bucket to a neww bucket. These operations are performed regularly so substantial time is spentt there in inserting and searching the data.

4.66 Conclusions

Wee have presented the f£-storage structure, a self organizing multi-attribute indexedd storage. The performance has been assessed using generated data fromm the drill down benchmark [BRK98]. Compared to the kd-tree, the fi-storagee method provides a highly stable performance for single records searchess over GBytes of data, while avoiding the highly skewed structures easilyy created by the kd-trees. This is realized by relaxing the constraints, whilee maintaining the intended properties of kd-trees like highly efficient pruning. . Futuree work, currently being investigated involves creating a scalable distributedd extension of the fï-storage. For this we plan to use the dissection splittingg algorithm of hQT* [Kar98] which was previously shown to provide excellentt partitioning for the distributed quad-tree structure.

Partt II Applicationss of SDDSs

109 9

Inn the following chapters, an overview of different database systems are given,, and we derive a set of "rule-of-thumb" for what to do and not to do lessonss learned from other database systems. Prom that, I then go further intoo the choices made for Monet and how Monet was enriched with an SDDS -- the LH*.

Chapterr 5 Databasee Systems

Inn this chapter we introduce the reader to various types of database sys• temss (DBMSs); we then provide examples of the need of high-performance databases.. The main architecture types of databases are explained. The im• portantt issue of scalability is then identified. We discuss the implicated need forr scalable data structures, scalability both from an accessing and a process• ingg point of view as well as for updates. In most practical cases, as will be seen,, parallelism or distribution is mostly used as a means for implementing high-performance.. The problem still remaining is that of scalability, i.e., the abilityy to grow/resize the application and the database to any unforeseen size. .

5.11 The Need for High Performance Databases s

Databases,, do we need them? During the 90s, databases management sys• temss have been extended to interface and accesss external data, legacy data andd unstructured data (web). Many common user applications, e.g. Mi• crosoftt products, now permit databases, such as MS SQL, to access external dataa sources, for example email-files, diverse database formats, spread-sheet data,, file-systems, and so on. However, there is another trend, using a differ• entt approach, at present mostly in the database research community, which iss to specialize the DBMS to a specific application. Examples of this can be foundd in the area of Engineering Databases or Scientific Databases[FJP90], wheree a large amount of data is handled not by the application, but a databasee engine. For efficiency the database should be extensible with op• erationss from the application domain and appropriate new types of indices. Computedd Aided Engineering (CAE) systems are applications of interest forr merging with databases, since they require advance modeling capabili• tiess as well as advance queries; this is explored in the FEAMOS research prototypee [Ors96]. It allows matrices to be used in the query language, and

111 1 112 2 CHAPTERCHAPTER 5. DATABASE SYSTEMS equationss can be solved by stating a declarative SQL-like query. Indications showw that application programs become more efficient and more flexible, theyy are easier to build. A popular way to implement this merge is to em• bedd the DBMS (code) into the application as a library. This then lets the applicationn directly traverse and use the data using a so-called fast path in• terface.. The database system can then also be extended to use, index, and queryy application data. DataBlades [SM96] is the Illustra concept of a pack• agingg a collection of data types together with access methods, and related functionss and operators into a module. Most other DBMS companies now developp similar concepts, under different names, but the idea is the same, modularityy extensible databases systems that can handle new types of data. Thee trend, to use database for more technical purpose, also draws inter• estt from the telecommunication industry [Dou90]. High-performance reliable DBMSs,, such as Clustra [Tor95], receives an increasing attention. The reli• ablee DBMSs allows down rates of less than a few minutes a year. There are estimatess of the needed rates of insertions, updates and also queries, and the numberr of such events is in the range of 10 000 per second. This high perfor• mancee in combination with high reliability are not possibly with available standardd commercial database systems. TimesTen [Tim] uses main-memory databasee technology to achieve high transaction rates. Currently (1999) they cann perform 10s of thousands transaction per second. Neww applications areas for high performance database systems are in the telecomm industry directory management, charging of calls, email-databases, andd multi-media repositories. Internet/Web based application servers such ass search engines, e-commerce web-sites benefit from using database sys• tems.. Many of these application databases are potentially not only huge in comparisonn with normal databases, but also require high throughput rates. Thesee services are more and more often service by (database) systems, insteadd of specialized applications. They have to be able to handle increas• inglyy larger data sets and data retrieval loads. One problem is that many databasee systems are static in their nature. That is, most simple database systemss are static in nature, the ability to give the same per transaction per• formancee when the amount of data doubles and/or the number of transac• tionss doubles cannot be achieved. They do not scale to unforeseen workload andd sizes. Optimally, a system would be able to deliver the same observed performancee when the demands doubles iff the system is giving the double resources. . Usingg current technology only distributed and parallel (database) sys• temss can build systems that cope with high storage demands. We now give ann overview of the principles of such database systems. 5.2.5.2. CONVENTIONAL DATABASES 113 3

5.22 Conventional Databases

Single-userr databases are widely available and can be used on off-the-shelf hardware,, such as single workstations or PCs. CentralCentral Databases are then the most common type of databases for multi• userr environments. They run on single (mainframe) computers. Banking databases,, travel agency booking or corporate billing databases are examples off central databases. Several users can access the database using either the oldd style terminals or using client/server software. SQL is the most widely usedd and standardized query language, OQL (Object Query Language) and QBEE (Query By Example) are other languages which are used. Inn the end of the 1990s it has become common to have distributed databasee systems, often because of company merges. In the panel debate att VLDB'1999 [Day99] a representative from Oracle mentioned that the in• tegrationn problems were so large that and cause so much more work that theyy were going back to more centralized solutions themselves, reducing the numberr of local databases in various formats and avoiding multi-database problems. .

5.33 Distributed Databases

AA Distributed Database System (DDBS) can be defined as "a collection of multiple,, logically interrelated databases distributed over a computer net• work"" [OV91]. Further on they also define a Distributed Database Manage• mentt System (DDBMS) as "the software system that permits the manage• mentt of the DDBS and makes the distribution transparent to the users." Thee important terms here are "logically interconnected", "distributed over aa computer network" and "transparent". Then they give examples of what iss not a DDBS:

•• a networked node where the whole database resides

•• a collection of files

•• a multiprocessor system

•• a shared-nothing multiprocessor system

•• a symmetrical multiprocessor system with identical processors and memoryy components.

•• a system where the OS is shared.

AA DDBS is often heterogeneous with respect to hardware and operat• ingg systems. The data is physically stored at different sites in component databases,, and the DDBMS is then the integration of these data into one 114 4 CHAPTERCHAPTER 5. DATABASE SYSTEMS virtuall "database". However, the same capabilities and software are usually partt of the individual component database. The transparency is the most importantt feature of a DDBMS. The main difference compared to multi- databasee systems, which we will discuss later, is that a distributed database distributess the data transparently over a number of nodes where each node usess the same DB software to manage its local data and where the nodes are coordinatedd through the DDBMS. Queries are can then be executed jointly andd coordinated to provide efficient execution.

5.44 Federated Databases

AA distributed database which allows for a mix of vendors for the individual DBMSS products as well as different storage schemas is called a federated database.database. Data may be split and stored on different database systems. Using aa gateway two DBMSs can access each others data. Thee PEER system [ATWH94] is a federated object management system prototypee that is "intended to represent and support complex data interrela• tionn and information exchange in multi-agent industrial automation applica• tions."" [ATWH94]. Data consistency is aided by coexisting schemas at each site.. A schema describes the part of information that a site has access to. Informationn is automatically administered to interrelate schemas and their derivations. . Thee Telegraph [HCS99] project proposes to focus more on storage and databasee management. The ideas involve a new storage manager which allowss interfacing using numerous methods, such as querying through databases,, normal files system access, HTTP (web) access, distributed per• sistentt data structures for scalable internet services. They say that it should bee based on a shared-nothing dataflow architecture1 that can balance load amongg the nodes. It will have an adaptive data layout system that automat• icallyy will handle fragmentation, replication, and migration of data within largee clusters of disks. As Mariposa [SAP* 96] it will be based on an economic federationn system. Finally, the system will integrate methods of interactive visualizationn of query results allowing for data discovery, browsing and min• ing.. Nice goals, however, it means integrating all the goodies from most of theirr previosly fruitful projects and making them work together in a complex system. .

5.55 Mult idat abases

Byy contrast a Multidatabase System (MDBS) is built up from a number of autonomouss DBMSs. Most of problems in the area of DDBMSs have their counterpartss in multidatabase systems, too. However, the design is bottom- up:: individual databases" already exist, and they have to be integrated to 5.6.5.6. DATA SERVERS 115 5 formm one schema. This involves translations of the different databases capa• bilitiess during query processing and data exchange. A multidatabase system hass to cope with different variants of query languages, and perform all moves off data itself. It acts as a layer of software in-between the databases and the user,, and the databases do not communicate with each other. Since multi- databasee updates are a problem, they are usually not allowed online. Instead, theyy are executed locally, or in batch mode.

5.66 Data Servers

Anotherr trend is to use to the availability of powerful workstations and par• allell computers for managing internal data in a DBMS. Such a computer dedicatedd for this purpose is called a Data Server. An example of this ap• proachh is shown in Figure 5.1, where several Terminals are connected to ann Application Server that handles user input and data display, parses the queryy and calls upon the Data Server to execute it. The database itself is storedd on a secondary storage media (disk). Data servers also seem to be becomingg popular as storage sites of distributed databases[OV91]. Onn typical example is SAP's 3-Tier Architecture [Mun99]. It employs onee single database server for rudimentary storage. To handle large loads theyy moved much of the database work into application servers. The appli• cationn servers have over the time been optimized using common database technology,, it features an interpreter, monitor, locking, and table indexing andd caching. The presentation layer uses front-ends that process actual re• questss from the users. Their system can use an SMP machine with more thann 60 CPUs and more than 700 GB database sizes. However, in effect theyy do not benefit from new database features since they have retreated too using the lowest common features of the database systems. Scalability is achievedd using faster machines, more main memory and more application andd presentation servers. Byy dedicating the computer for a data server, it is then easier to tune the memoryy management algorithms. Usually the database systems have more knowledgee than the operating system as to how and when it uses what data. Inn the 1970s the idea of dividing the database management system into two parts,, a host computer part and a back-end computer, appeared [CRDHW74]. Laterr the terms application server and data server were used respectively. Figuree 5.1 shows the main idea.

5.77 Parallel Data Servers

Parallell computers are nowadays becoming more and more widespread. Dur• ingg the 1990s PC file servers commonly use 2 or more processors, and it is becomingg affordable even for single workstations. For such hardware DDBS 116 6 CHAPTERCHAPTER 5. DATABASE SYSTEMS

Terminal l Terminal l

v v V V Userr Interface

Queryy Parsing

Dataa Server Interface

Applicationn Server A

Dataa Server J

Applicationn Server Interface

Databasee Functions

rr^ rr^ Database e

Figuree 5.1: Data and application servers. 5.8.5.8. DATABASE MACHINES 117 7 technologyy is used in implementing parallel data servers. A parallel data serverr is essentially implemented on a parallel computer and makes exten• sivee use of the advantages of the parallelism in data management that then cann be gained. Often, support for distributed databases is part of the imple• mentation.. The data managed is automatically fragmented or de-clustered, makingg the system self-balancing. The work on parallel data servers is re• latedd to the work on Database Machines, which will be discussed in the next section.. However, since special parallel hardware computers are expensive andd current technology is advancing fast, the trend is to use a number of net• workedd mainstream machines for implementing the parallel data servers [?]. Ronströmm [Ron98] used a special fast interconnect network in building clus• terss of machines (network multicomputers). The network used was the open standardd Scalable Coherent Interface, SCI [IEE92], which has received con• siderablee attention during the 1990s.

5.88 Database Machines

Relatedd to the work on parallel data servers is the earlier work done in the frameworkk of Database Machines. Below we explain the term and present aa short overview of some selected systems. In the next chapter we then go furtherr into details of how large amounts of data are managed in very large systems. . Thee first mention of a Database Machine was in [CRDHW74], later the termss Database Computer, or Data Server have been used for a DBMS- dedicatedd machine. A dedicated machine has become a natural choice in a distributedd environment [OV91]. In such a machine there is not an operating systemm in the ordinary sense. Hence, the DBMS has specially tailored op• eratingg system services; minimally this means just dedicated device drivers andd a monitor. This is in contrast to a more typical DBMS environment onn a general-purpose computer with some operating system. The reason for havingg a dedicated machine with more specialized software and hardware iss to overcome the I/O limitations [BD83] of the von Neumann computer architecturee and other restrictions. Another reason is to be able to use tech• nologyy that is not yet available off-the-shelf. One way to overcome I/O limi• tationss is to keep the whole database in stable main memory [LR85] or I/O bandwidthh can be increased by using parallel I/O [Du84]. Multiprocessor computerss have been studied for performance and data availability. Theree are mainly two types of parallel computer architectures. The Shared-EverythingShared-Everything type of computers provide high performance but are not scalablee to any larger sizes. All the nodes share memory, disks and all other resourcess are typically communicated via shared buses. Examples include thee Sequent Computers, Sun SPARC/Center machine, SGI Origin 2000. Itt is widely known that this architecture limits the size of an efficient 118 8 CHAPTERCHAPTER 5. DATABASE SYSTEMS systemm to around 32 processors. However, it is relatively easy to program. Thee alternative, the Shared-Nothing computer type, requires extensive pro• grammingg to share any information, and to perform any kind of work jointly usingg the available resources. Often new algorithms have to be engineered, andd much research is concerned with finding algorithms to use the power of thee shared-nothing computers. The benefits are that, if one succeeds in pro• grammingg the shared-not hing computer in a scalable way, the application cann scale to a large number of processors.

5.99 Overview of Some Data Servers

Inn Parallel Database Systems: The Future of High Performance Database SystemsSystems [DG92] there is an overview of state of the art commercial paral• lell systems. Teradata is a shared-nothing parallel SQL system that shows near-linearr speed-up and scale-up to a hundred processors. The system acts ass a server back-end and the front-end application programs run on con• ventionall computers. The Tandem NonStop SQL, currently (1999) wned by Compaq,, system uses processor clusters running both server and application softwaree on the same operating system and processors. The Gamma system, too,, shows near linear speed-up and scale-up for queries; it runs on 's iPSC/22 Hypercube with a disk connected to each node. An implementation off Oracle runs on a 64-node nCUBE shared nothing, with good price- per• formancee measures: also it was the first system to provide 1000 transactions perr second. Exampless of shared-not hing databases are Bubba [BAC+90], Ter• adataa DBC/1012 [Cor88], Gamma [DGG+86] and the Tandem Non• stopp SQL [Tan87]. Examples of shared-memory database systems are XPRSS [SKP088], and the Sequent machine. BubbaBubba [BAC+90] started out in 1984. The aim was to design a scalable, high-performancee and highly available database system that would cost less perr performance unit than the mainframes in the 1990s. At the beginning thee Bubba project was mostly concerned with parallelizing the intermediate language,, FAD. FAD was used for LDL [CGK+90] compilation. The FAD languagee has complex objects, OIDs, set- and tuple-oriented data manipula• torss and control primitives. Both transient and permanent data are manipu• latedd the same way. The FAD program was translated into the Parallel FAD languagee extension, in combination with the Bubba Operation System they built.. In the project data placement, process and dataflow control, inter• connectionn topology, schema design, locking, safe RAM and recovery were studied.. They later regretted including all these features and functions that limitedd the complete study of the complex systems. In their first prototype theyy learned quite a few "lessons", as they say. Parallelism, for example, givess rise to extra costs in terms of processes, messages and delays. They 5.9.5.9. OVERVIEW OF SOME DATA SERVERS 119 9 foundd that dataflow control was another important issue. They redesigned thee language and the implementation. Another identified problem was their usagee of three different storage formats for objects (disk, memory and mes• sage).. Their use of the C++ environment did not make things easier. So inn their second system they used the C language. The second, and perhaps moree realistic prototype, was rewritten from scratch. There were several reasonss for this, including C++, new programmers, and serious robustness problems.. Since this was not to be a commercial system, not all important featuress of the Bubba system were implemented. In the new system only onee type of object representation was used, and it was the same for disk, memoryy and messages. For the Bubba Operating System, the AT&T UNIX wass used, with some extensions. Their conclusions at the end of their final Bubbaa prototype are that "Shared nothing is a good idea (but has limita• tions)";; "dataflow seems better than remote procedure calls (RPCs) for a shared-nothingg architecture"; "More compilations and less run-time inter• pretation"" , "Uniform object management", and that for fault tolerance it is betterr to replace a failing node than trying to make the nodes fault toler• ant.. Apart from this, they mention that there was some trouble finding a commercially-availablee hardware platform for their work. Even though the hardwaree and software were bought, there were both software (operating system)) and hardware bugs, but eventually the system functioned properly. Anotherr system was the PRISMA/DB system [AvdBF+92], which was aa parallel, memory relational DBMS. It was built from scratch using easily availablee hardware at the time. It was built on the POOMA shared-nothing machine.. Each of the 100-nodes (68020) had 16 Mbytes of memory and they weree interconnected in a configurable way. The Parallel Object Oriented Languagee (POOL-X) was developed, that featured processes, dynamic ob• jects,, and synchronous as well as asynchronous communication. Some of its specialtiess were that it could create tuple types on the fly and conditions onn them could be compiled into routines. This helped to speed up scanning, selectionss and joins. The project was not as successful as one might expect fromm a main-memory system. It was not really a magnitude better than disk-basedd DBMSs. The problems included the facts that the hardware did nott run in full speed, that the hardware was outdated when the project was evaluated,, and that the compiler of the experimental programming language wass not fully optimized. But among the positive results they found that by usingg their language they managed to build a fully functional DBMS from scratch,, and the project could then be finished on time. Thee XPRS (extended Postgress on Raid and Sprite) DBMS [SKP088] wass aimed at high availability and high performance for complex ad-hoc queriess in applications with large objects. It was optimized for either a sin• glee CPU system or a shared-memory multiprocessor system. The aim was to showw that a general purpose operating system can also provide high trans• actionn rates and that custom low-level operating systems are not a necessity. 120 0 CHAPTERCHAPTER 5. DATABASE SYSTEMS

Theyy were much concerned with removing hot spots in data accesses. This wass done by reducing the time the locks are being held using a new lock• ingg schema, and by running DBMS commands in a transaction in parallel. AA fast path schema was proposed to achieve high performance, as opposed too the common method of stripping out high-level functionality (such as queryy optimizations and views) from the DBMS. For better performance whenn I/O-ing large objects they built a two-dimensional file system. This achievedd reduction of the mean time to failure (MTTF) then cured by using RAIDD [PGK88] or striping techniques that provides fault tolerance. These techniquess keep a bit parity block for N disk blocks (on different disks). Thiss block can be used to reconstruct any of the N -\- 1 blocks from the N otherr blocks. Thereby the overhead is reduced to l/N. DBS3DBS3 [BCV91] used the assumption that the success of RISCs lies in simplicityy and high performance compilers, that large main memories will bee available, and that one should rely on advanced OSs. The first implies thatt a good optimizer and simple basic DB units are more important than havingg a complex design and a complex language to program it. Furthermore thee whole database can be entirely stored (cached) in main memory. This simplifiess optimization and cache management. Lastly, portability is now im• portantt and most newer OS include better means for memory management (cachee tuning, virtual memory, mapped I/O) and transactions (threads). Permanentt data is stored apart from temporary data. This two-level storage dividess data so that permanent data is stored on disk and temporary data is managedd in (virtual) main memory. It can make use of fragmented relations, bothh temporal and permanent. Zero or more indices can be used for each fragment.. The transaction processing is aimed at online transactions and decision-supportt queries. Using their parallel execution model, they finally achievedd good intra-query parallelism, using pipelining and declustering. Ass part of the Sequoia 2000 Project at the University of California, the MariposaMariposa project [SAP+96] proposes a micro-economic model for Wide Area DistributedDistributed Database Systems. Their system uses terms from the market economy:: sellers, bidders, brokers, budget, purchase, advertisements, yellow pages,, coupons, bulk contracts and the term "greedy". The idea is to set up ann economy and means for trade and then let the "invisible hand" guide the actuall trading of resources. Worthh mentioning are also the commercial systems that includes DB2, andd Microsoft SQL Server.

5.100 DB History

Inn the beginning of the 1980s designing special hardware was very popular; nowadayss one tries to use ordinary high-performance workstations. New ex• tensionss in the area of memory management are beginning to emerge in the 5.11.5.11. CONCLUSIONS 121 1 operatingg systems (UNIX, Windows NT) that will allow more and better controll over the system's resources. And instead of parallel computers the networkedd parallel computer is becoming more widespread. Special inter• connect,, such as SCI [IEE92], set standard in the 90s for high-performance networkk computers. Duringg the end of the 1990s "hot areas" for database systems include thee usage of database systems for storing and generating web-pages, such as searchh engines. These systems more than ever requires scaling for growing amountss of data and user queries. Especially interesting is database systems thatt keep (most) data in main memory a trend allowing for much faster and higherr throughput than disk based database systems. Research prototypes existedd for many years, these have recently started to be commercialized by forr example TimesTen [Tim] database system.

5.111 Conclusions

Thee need for "big" database servers will always be here; distributed coop• erativee solutions arise, but nether the less local solutions will dominate. Myy impressions of the experience gained from the projects mentioned abovee can be concluded as follows:

•• In a shared-nothing architecture dataflow is considered a better paradigmm than RPC for query processing.

•• The system should be self-managing and self-balancing.

•• Fault tolerance is provided by replacement of a faulty node rather than makingg a node fault tolerant. Replications allows fast switch-over.

•• Do not build your own hardware, it will become outdated before the projectt is finished.

•• New hardware and operating systems are error-prone; move to a stable platform. .

•• Do not write your own experimental implementation language.

•• Do not write a compiler, you will not be able to optimize it completely.

•• Shared-memory is easier1 to program than shared nothing; it does, however,, not scale to more than a certain number of nodes, currently

lMyy honored Licenciate opponent Oystein Torbj0rnsen, had a bit different view on the matters.. He meant that the only way to master/simplify/make efficient implementations onn a shared memory machine is to program it as a shared-nothing machine. Erlang, is a programmingg language that promotes the same idea, by not apparently sharing data. The systemm can however of for efficiency share data in its implementation. 122 2 CHAPTERCHAPTER 5. DATABASE SYSTEMS

substantiallyy less than thousands. Newer parallel computers are likely too have shared-memory (everything) processor nodes connected into a shared-nothingg multi-computer.

•• Use the same format for all storage of the same data: on disk, in memory,, in buffers, in messages.

•• Use fast-path access to data instead of stripping high-level functional• itiess from the DBMS.

•• Parallel I/O-systems give high-performance.

•• Do not implement all features and functions in all possible variants.

5.122 Properties of Data Structures for Parallel Dataa Servers

Inn this section we will give an introduction to required properties of data structuress that are used in distributed data applications, such as Paral• lel/Distributedd Data Servers.

5.12.11 The Problem Modernn systems manage high volumes of data, and if they implement data accesss paths (indices) at all, they are often hard-coded with the application's data.. Data is indexed through some key identifying the data; this can effi• cientlyy be implemented by using hashing algorithms or some tree structure thatt keeps the data sorted. It is now well-known that most systems mainly usee variants of Linear Hashing [Lit80] or B(+)-Trees [BM72] for their access paths.. For example, DB2 only supports the B-tree style index. IBM is re• luctantt to add other indices into the DB core of DB2, instead new indices aree implemented by mapping their structure down to B-trees [?]. Other ex• ampless of indices include R-trees (spatial), AVL-trees (main memory sorted index)) and SpiralStorage.

5.12.22 Scalability Inn DBMSs the need for scalable data structures is more obvious than for specializedd programs. Whatever arbitrary upper limit is set on the amount off data a data structure or a DBMS can handle, it will probably be exceeded att some future time. A scalable data structure can be characterized by the following: :

•• Insert and retrieval time is independent of the number of stored ele• mentss (i.e., it is more or less constant). 5.12.5.12. PROPERTIES OF STRUCTURES FOR SERVERS 123 3

•• It can handle any amount of data, there is no theoretical upper limit thatt degrades the performance.

•• Furthermore, it is desirable that it grows and shrinks gracefully, not havingg to reorganize itself totally (as some hashing structures do total rehashingg of all stored elements), but rather incrementally reorganizes itselff during normal processing.

Linearr Hashing [Lit80] is an example of a scalable data structure, which iss an algorithm for managing random access data that can dynamically grow orr shrink in size. It is based on ordinary hashing schemes and has therefore thee advantage of direct access, but not the limitations of a fixed array of buckets.. The array is allowed to grow when the data structure reaches a certainn saturation limit, or shrink when it decreases below some limit. This is achievedd through splitting and merging of individual buckets. Other variants includee Spiral Storage and Extensible Hashing. The access cost for hashing structuress is approximately constant. B-treess [BM72] are another example. This algorithm maintains a set of orderedd data. The data is stored in leaf nodes that are allowed to store a minimumm and a maximum number of elements. When the number of ele• mentss exceeds the maximum or is below the minimum, the leaf is either splitt or merged with other leaves. The index is maintained in the tree nodes usingg a similar principle, providing in the end not linear access cost but logarithmic.. Variants of scalable ordered data structures include AVL-trees, 2-3-trees,, and others.

5.12.33 Distribution Sometimess the amount of data is larger than what can efficiently be man• agedd or used by a single workstation. Even if this amount of data could bee connected physically to one workstation, the processing capabilities of thee workstation would not be enough for searching and processing the data. Then,, instead, one can employ a distributed data structure that distributes thee data over a number of nodes, i.e. workstations. Such a data structure can thenn be used to keep very large amounts of data online. One way to do this is too apply a hash function on the keys that partitions the data into fragments. Thee simplest idea is that each fragment stores the data of one bucket, one orr several fragments are then stored on each node. Some other schemas re• quiree a central directory that is visited before each retrieval or insertion of a dataa item to get the address of the node storing it. This solves the problem off finding where the data resides if some of it has been moved (because of reorganization).. However, the directory can easily become a hot spot when manyy clients are accessing it. Solutions using hierarchies of distributed di• rectoriess can then be used, they can then cache results of earlier requests to improvee performance. This is a schema similar to the well-known internet 124 4 CHAPTERCHAPTER 5. DATABASE SYSTEMS

DNSS service [Ste94]. Another, simpler distribution strategy is to store one fieldfield of a record for all records in a file entirely in one node and other fields onn other nodes. However, if the amount of data grows fast, this is no scalable solution.. We notice that these distributed data structures will also have to bee scalable over any number of storage sites. This is a relatively new concept SDDSss Scalable Distributed Data Structures.

5.12.44 Availability Sometimess distribution is used in combination with some redundancy to achievee high availability. High availability is necessary, for example in bank• ingg and telecom [Tor95] applications, but also in other areas with online transactions,, or where the information is of such importance that the extra downn time of reading backups cannot be allowed. Using a high availability schema,, disk crashes as well as some other sources of read or write errors cann then be recovered from. The classical variant here is RAIDs, Redundant ArrayArray Independent Disks [PGK88] where a number of disks are connected to onee or several computers. One of the disks is used for storing a parity page; thiss page is calculated by ior-ing a disk-page from each disk and storing the resultt on the parity disk. Each time a write is being performed, this page is updated.. If one of the disks fails or a page on one of the disks fails, it can be recoveredd (reconstructed) from the other disks. Using more disks for parity cann ensure recovery from more "errors" and thus gives higher availability.

5.12.55 Conclusions Neww applications of databases put requirements on the memory manage• ment.. Among these requirements we identified that high performance re• quiress scalability, distribution and high availability. Data structure access hass to hide the implementation of these above aspects and make the access transparant.. In this chapter we discussed these three (orthogonal) aspects, butt not the important joint usage of them. Scalable solutions will by neces• sityy be distributed and will require high availability and new data structures, suchh as the SDDSs, are important for achieving this. Chapterr 6 Scalablee Distributed Databasee Storage Manager/Databasee System

ThisThis chapter essentially contains material from the paper "Transparent Dis- tributiontribution in a Storage Manager" [KK98] presented at PDPTA '98 in Las Vegas. Vegas. WeWe explore one direction of implementation of an SDDS-enabled database system.system. The system is based on MONET [BK95J main-memory database systemsystem and allows for a seamless integration.

6.11 Seamless SDDS integration in an Extensible DBMS S

Abstract t WeWe will now validate the concept of integrating a Scalable Distributed Data StructureStructure (SDDS) into an extensible DBMS. Seamlessly allowing for paral- lellel processing in a multicomputer. We show that this merge provides high performanceperformance processing and scalable storage of very large sets of distributed data. data. FurtherFurther on, we show by extending the relational algebra interpreter that accessaccess to data, whether it is distributed or locally stored, can be made trans- parentparent to the database application (user). This concept potentially allows existingexisting applications of database systems to efficiently process much more datadata than a single workstation ByBy seamless integration in the extensible database system we have brought laborative/isolatedlaborative/isolated studies on SDDSs into the realm of viable alternatives for

125 5 1266 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER distributeddistributed database systems. WeWe illustrate the performance efficiency by several experiments on a large networknetwork of workstations. For several operators we achieve perfect scale-up, i.e.i.e. doubling the number of nodes allows double amount of processing in the samesame time.

6.22 Introduction

Overr the last decades major progress has taken place in efficient data man• agementt in a distributed setting. Many commercial DBMS already provide thee hooks to control data distribution data and query optimizers to exploit it.. However, modern applications, such as GIS and Data Mining, continue to stresss the need for better techniques; both in terms of its scalability and its performance.. Especially the limited and rigid scheme deployed by the data distributionss and the resulting complexity of the query optimizers hinder majorr breakthroughs. Inn this paper we promote deployment of the SDDSs at a broader scale. First,, the SDDS LH* [LNS93] has been integrated in a full fledged extensible databasee system. Second, we demonstrate how the functionality of SDDS can bee exploited for improving performance of certain algebraic operations. ScalableScalable Distributed Data Structures (SDDSs) [LNS93] [LNS96] in par• ticularr addresses the issue of scalable storage, i.e. the ability to administer anyany foreseen and not foreseen amount of data distributed over a number of processingg nodes. Most studies have been focused on algorithm and experi• mentss in an isolated context, i.e., not integrated in a database system. Theirr integration solves a problem often encountered in distributed sys• tems,, where manual data (re-)distribution are required to handle increas• inglyy larger data sets. SDDS storage scale automatically without any costly re-organizationn of all data in lock-step mode. They indeed provide transpar• entt fragmentation. Ourr target experimentation platform is a network multi-computer, i.e. a collectionn of workstations and SMP (Symmetric Multi Processor) computers. Itt enables the system to grow over time by adding components to meet the increasedd storage and processing demands. Thee outline of our paper is as follows. In Section 6.3 we shortly de• scribee the key properties of Scalable Distributed Data Structures and the implementationn platform Monet. Then, in Section 6.4, we describe the imple• mentationn and its integration of SDDSs in Monet. The algebraic operators aree analyzed in Section 6.4.4. Section 6.5 shows performance measures of ourr implementation. Finally, in Section 6.6 we conclude our work and give directionss for future research. 6.3.6.3. BACKGROUND 127 7

6.33 Background

Inn this section we give a short introduction to the key concepts of Scalable Distributedd Data Structures and the Monet database system.

6.3.11 Scalable Distributed Data Structures Scalablee Distributed Data Structures (SDDSs) [LNS93] can be classified as aa general access-path mechanism, examples are [Dev93] [WBW94] [KW94] [LNS94]] [KLR96] [Kar97]. SDDSs allow storage of a very large number of tupless distributed over any number of nodes. The tuples are distributed to differentt nodes according to their key value and the state of the SDDS. The primaryy means for retrieval is again the key value. The objective of SDDSs is too minimize the messages needed to locate the tuple anywhere in the system. Typically,, a tuple can be accessed with at most two network messages. One messagee for sending the request of the tuple and one message for returning thee data. SDDSss differs from other distributed schemas in that it allows the numberr of nodes to increase at very small cost. Many other distributed dataa structures require a (costly) total re-organization for adding a node, becausee they employ static distribution schemas. Examples proposed in thee literature are round-robin [Cor88], hash-declustering [KTM084], range- partitioningg [DGG+86]. The ability of SDDSs to scale to several nodes by acquiringg one node at a time and gradual reorganization opens up new areas off storage capacity and data access. Forr accessing data stored using an SDDS, a client can calculate where aa tuple resides using the key. In the case that the client is not fully aware off the number of servers involved in storing the distributed data, the server receivingg the request will forward it towards its correct destination. The reasonn that the client's calculation may lead to addressing errors is that the clientss are not actively updated when the SDDS is partly re-organized. The factss on how the data is distributed is called the clients image. When a server receivess a mal-addressed request it will in addition to the forwarding sendd back an Image Adjust Message to the client. This message improves thee image of the client, preventing it from repeating the same mistake. The rationalee behind this is that the number of clients may be too large to be efficientlyy updated continuously, also, not all clients might be interested in thee most accurate information at all times. This design decision is aimed at minimizingg communication overhead. The incurred cost for updating clients hass been shown to be low [LNS96] in laboratory settings on a simulation studies. . Inn conclusion the main three features of an SDDS are:

•• There is no central directory that clients have to go through for data access.. This avoids hot-spots. 1288 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER

•• Each client accessing the data, has an approximately image of how thee data is distributed. The image is lazily updated by servers. The updatess only occurs when clients make addressing errors.

•• The servers are responsible to handle all requests from clients, even if thee client makes an addressing error. It is also the responsibility of the serverr to update the client.

LH** [LNS93], which is our choice of implementation, is a distributed variantt of Linear Hashing [Lit80]. The LH* schema allows efficient hash- basedd retrieval of any tuple based on the tuple key. Insertion requires, on average,, one message and retrieval two messages. If the clients image is outdatedd the request is forwarded and the client is updated by the server. Thee prime drawback of the earlier reported studies, is that the opera• tionss directly supported are limited to individual tuple access. It requires generalizationn to support a relational algebra, which we address in this pa• per. .

6.3.22 Monet

Monet11 [BK95][BMK99] provides for the Next Generation DBMS Solutions usingg todays trends in hardware and operation system technology. Monet's featuress include: decomposed storage model; using binary relations only; main-memoryy algorithms for query processing; modular extensibility for dataa structures; and indices and methods. High throughput is achieved with extensivee use of bulk operators. Monett is successfully used in Data Mining applications [HKMT95] and GIS[BQK96],, and its supreme performance has been demonstrated against severall benchmarks, including 007[BKK96] and TPC-D[BWK98].

6.44 SDDS within an Extensible Database System

Addingg SDDSs functionality to an DBMS implies that the system can be scaledd to larger dimensions, both when it comes to query processing as well ass to storage capacity. An extensible DBMSs should ease this integration by allowingg for he necessarily extensions, identified below.

6.4.11 SDDS requirements on a DBMS AA transparent and efficient integration of SDDSs into a DBMS, without completee redevelopment of the DBMS kernel, poses several requirements on itss functionality. The DBMS must: http://www.. cwi.nl/~monet 6.4.6.4. SDDS WITHIN AN EXTENSIBLE DATABASE SYSTEM 129

•• be extensible at multiple levels, i.e. enable addition of new data types, algorithmss and operators. Unfortunately, few (commercial) DBMS providee sufficient functionality in this area so far.

•• provide a general communication package, such that the SDDS imple• mentationn does not have to "know" about what data is transported. Suchh a module can itself be an extension.

•• transparent handling of native and user defined types, for example to bee able to use native tables as building blocks for the management of thee data held within the SDDS.

Ideally,, the SDDS can be used in very much the same way as non- distributedd data. In a database perspective it means that SDDSs should exhibitt the same behavior as operations on its native tables. Further on, tablee operators has been extended with their semantically equivalent coun• terpartss that hide the effects of the SDDS data distribution. SDDSs just becomee another data type, with the normal relational operators defined uponn it.

6.4.22 Resource Management Wee assumes that each SDDS has a logical numbering of its participating nodess numbered 0..n - 1 by its algorithm or a mapping thereof, n being the numberr of nodes employed. When a node is addressed, we map the logical numberr to a virtual machine number. This virtual number is then translated too the unique machine number, which in turn can be used to get to the internett address (IP-number). This is a convenient way to cluster different SDDS'ss data so that they use the same physical distribution, "syncing" on thee primary key. Forr example, if two relations (SDDSs) are indexed on the same key, thenn it makes sense to cluster data in such a way that all tuples with the samee keys are kept at the same physical node. Also, SDDS load balanc• ingg [WBW94] can benefit from a strategy where several logical nodes of the samee SDDS are mapped onto the same machine. This information is kept betweenn different loads of the database. Therefore the same clustering and loadd balancing effect can be achieved at the next load onto a different set of physicall machines.

6.4.33 SDDS Administration Thee management of the SDDS requires some adminstrative information. It is usedd to access the data stored in the structure. Below we list the information ass it was needed in Monet. 1300 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER

•• the home location, the virtual machine number where the zeroth (0th) logicall server node of the SDDS is kept,

•• the mapping from logical nodes to virtual machine numbers,

•• the unique identity number of the SDDS,

•• the clients current image of the SDDS,

•• and finally, servers store the distributed data of the SDDS.

Wee introduce two user defined types into Monet to support SDDS. First, aa client data structure, contains the necessary state information. It also acts ass a handle to a binary relation containing the mapping from logical node to virtuall machine. Second, a server data structure that contains the identity off the distributed table, the logical number of the server, and a handle to thee table containing the data stored by that node of the SDDS. Details can bee found in [KK97].

6.4.44 Algebraic Operations Studiess on SDDSs have focussed on individual tuple access. In [SAS95] they investigatee the usage of an SDDS for a hash join algorithm, PJLH. Their algorithmm overcomes previous drawbacks of static hash joins by adopting thee number of participating nodes to the workload. The results are then extendedd for multi-joins. The join sites of PJLH are disjoint from the sites storingg the participating join relations. However, in our setting, this is in• sufficient.. Instead, we have added extended relational operators to deal with thee SDDSs storage layout. For our implementation we merely assume that a numberr of relations has been chosen to be distributed using a SDDS schema and,, in many cases, the structure is inherited by the result. Querying these relationss is transparent. Thee operators have been chosen for ease of implementation, because we aree primarily interested in the overhead incurred by the use of SDDSs on thee database kernel. In the experiment we focus on the select and join operators.. They are the key operators needed to support SQL-like query languages.. Furthermore, their implementations show characteristics typical forr a large group of relational operations. An informal semantic description iss shown below, where A, B are binary relations and a, b, c, d are (scalar) data,, such as integers or strings:

operator r definition n selectt (A, h, 1) {(a,6)) eA\\l

Thee select operator is rather straightforward to implement on an SDDS.. Data is sent when needed to SDDS nodes for processing together withh the SDDS data. The processing takes place in parallel, and results are thenn sent back to the originator of the operation, join operators, however, requiree more implementation effort, especially in the case of join over two SDDS-basedd tables. The different settings and ideas used are elaborated. Inn the examples below all SDDSs have been distributed on the key at• tribute,, i.e. SDDS = {(a,fr)}, where the distribution is based on the key attributee a.

•• select(SDDS, low, high) -> table Thee select operator broadcasts* a local select to all sites of the SDDS. Resultss are returned to the node that initiated the query.

•• join(SDDS, table) -> table Thee table is broadcasted to all nodes of SDDS where a local join takes place.. The result is then returned to the originating node of the query.

•• join(table, SDDS) -> table Thee table is distributed on its join-attribute onto the same number off nodes of the SDDS. Effectively this turns into a hash-join. Results aree then returned.

Inn the algorithms above, a substantial improvement could be achieved for certainn further operators if the results were kept distributed. The decision too keep the results distributed or not depends on a number of parameters, andd requires a cost based model, in general.

6.55 Implementation and Performance Study

Thee implementation is such that other SDDSs can easily be included. Since thee dominant cost is network transport we expect similar performance char• acteristics,, however. We now report on results obtained by the integration off LH* in Monet. Thee experimentation is geared towards uncovering implementation prob• lemss and to obtain a first assessment of the overhead incurred in distributed processingg under the SDDS. We want to show the following:

•• Optimal size of a distributed partition,

2Broadcastingg to "all" nodes of an SDDS requires special care. The client does not normallyy know all participating nodes. We use the inherit structure of the splitting pattern too forward broadcasts, with addition of extra messages that incrementally update the client.. In the end, leaving the client with enough information to complete the operations. 1322 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER

•• Overhead added by SDDSs,

•• Performance scalability.

Wee use a network multicomputer [Cul94] [Tan95] in our case a number off Silicon Graphics 02s, running IRIX6.3, each having at least 64 MBytes off memory. For communication the office network is used. This network is a mixx of ATM-switches and Ethernet. Each workstation has a 180 MHz, R5000 MIPSS CPU. Measures are given in milliseconds (ms). The experiments are runn a number of times, this to decrease the disturbance from other uses of thee network and computers, the best results are then kept and used for the graphs. . Loadingg of a database may be done in several different ways. If there is onlyy one source, it could be segmented and the data could be loaded N-way parallel.. We will not go into further details of different ways of loading data distributedly.. We assume that when the queries are run that appropriate startingg relations have already been loaded and distributed. Wee use two tables big and t5. big is an SDDS table stored over a number off nodes, whereas t5 is a main memory table at the front node. The size of thee table t5 is fixed to 100 000 entries (800 KBytes). The contents of both tabless are pairs of integers (int, int). Values are unique, and data is not storedd sorted. During the query processing indices may be created when it benefitss operators. Thee select operator scans the whole SDDS (LH*-)table to find the matchingg values. In all our experiments we assume only the SDDS to be dis• tributed.. Operators that operate on local data and SDDS data, for example aa join between a local table and an SDDS table, first get the data sent to the relevantt SDDS nodes. These nodes execute the operator in parallel, sending backk the results, to be combined. Note, that the timing in our experiments includeinclude the time for collecting the result at a single node.

6.5.11 Optimal Size of a Distributed Partition AA large file, larger than main memory, cannot be searched with high per• formance,, if it fully resides on a single node. We investigate for a number off operators the behavior for increasingly larger datasets to find the point wheree their performance degrades into that of a disk-based system. This givess us the maximal size of a partition for a distributed table, under both networkk access and CPU cost. Inn Figure 6.1, we show the time to execute range selection and joins usingg a increasingly larger relation. The range selection selects values in the intervall 1 to 10 on the SDDS, and the two different joins. On the x-axis he filee size is shown in MBytes, and on the y-axis the time in ms. Theree is a big performance gap in scanning a 48 MBytes table compared too a 40 MBytes table. This illustrates that the table/file cannot be kept 6.5.6.5. IMPLEMENTATION AND PERFORMANCE STUDY 133

45000 0

40000 0

35000 0

30000 0

„„ 25000 CO O E E

20000 0

15000 0

10000 0

5000 0 0 0 00 10 20 30 40 50 60 70 80 90 [MBytes] ]

Figuree 6.1: Local memory on one node, varying sizes of data. whollyy in main memory anymore. Joinss were made with table t5. The cost is approximately linear up to a filee size of 48 MBytes. For larger tables the performance degrades quickly, becausee the datasets do not fit entirely into main memory. We studied the

Bytes s #pagefaults s elapsedd [ms] userr [ms] systemm [ms] 8MB B 1 1 ------16MB B 2 2 ------244 MB 2 2 1344 4 1240 0 40 0 32MB B 5 5 1874 4 1660 0 80 0 400 MB 213 3 3060 0 2070 0 120 0 488 MB 11408 8 666 000 2490 0 1900 0 566 MB 13676 6 1222 000 2930 0 3060 0 64MB B 16037 7 1033 000 3320 0 2920 0

Tablee 6.1: Memory page faults, and elapsed/user/system time for the oper• ation. . numberr of memory faults number of pages need to be swapped in from diskk to more clearly understand the actual performance degradation. Ass shown in table 6.1, for the select (1,10) operation the time increases slowlyy for smaller sets of data up to 40 MBytes of data. At 48 MBytes and beyondd the number of pagefaults clearly corresponds to the size of the data set,, which means that all of the data have been swapped in from the disk. Thee table also shows the elapsed time, user CPU time, and system CPU time.. The latter two grow linearly with the increase of data, whereas the elapsedd time indicates waiting for disk. Our solution, which is presented in thee following text, is to distribute the data using an SDDS, LH*, and we 1344 CHAPTER 6. SCALABLE DISTRLBUTED STORAGE MANAGER showw we can query even larger sets of distributed data within the same time.

6.5.22 Overhead added by SDDSs Inn this experiment we keep the current file size below 40 MB to ensure mem• oryy residence of the data. Data is stored on one node and queried remotely. Interestingly,, the overhead for scanning select (1,10) is kept reasonably loww in the experiments conducted, as shown in Figure 6.2. The overhead of aa scan (select) operation on the file stored at a different node is limited too around 300 ms for the different sized files (8 MB, 16 MB, 32 MB). This includess the time to send back the results. Joins, however, loose in perfor• mancee directly, because the amount of data to be transfered to the remote node. .

25000 0 join(t5,, big) -» joinjbig,, t5) -+ selectO.IO), , selecUff ' 20000 0

15000 0 IF IF E, ,

10000 0

5000 0 -e-- 0 0 55 10 15 20 25 30 35 [MBytes] ]

Figuree 6.2: One node, distributed access, varying sizes.

6.5.33 Performance Scalability Wee show the performance by varying two parameters.

•• The size (cardinality) of the SDDSs stored.

•• The number of server nodes (workstations).

Inn the first experiment, we keep the size constant at 1M entries giving a tablee using 8MBytes. The number of nodes involved is varied from 1 to 16, consequently. . Thee experiment confirms that employing more nodes to store the same amountt of data is beneficial for scanning, see Figure 6.3. Range value scan• ningg even improves in performance linear with the number of nodes. No• ticeablee is that the time directly halves for one of the joins join (big, t5) 6.5.6.5. IMPLEMENTATION AND PERFORMANCE STUDY 135

10000 0 join(t5,, big) 9000 0 joinjbig,, t§). selgetffCl&l-'-Q Q 8000 0 C^ïetéct(5)) x

7000 0

6000 0

5000 0

4000 0

3000 0

2000 0

1000 0

0 0 44 5 [#nodes] ]

Figuree 6.3: Constant sized relation 8 MBytes, varying number of nodes. usingg one more node. Then the cost increases somewhat for both of the joinss due to sending t5 to more nodes. However, this cost does not increase sharply,, which means that more data can be stored at more nodes without tooo much overhead in the cost. This holds under the assumption that the temporaryy data and the partition of the distributed relation fits into the memoryy available. Thee second experiment varies the size from 1M entries to 8M entries, withh a database size of 8MBytes to 64MBytes. The number of nodes is kept constantlyy at 8. Figure 6.4 illustrates that an increase of data on a fixed

i i II I II I joiniI5rb1ST T _-^iöln(big,, t5) 2000 0 -""\.SBJect

6000 0 -- --

4000 0 -- --

2000 0

St" " ii 1 X X 20 0 300 40 50 0 60 0 70 0 [MBytes] ]

Figuree 6.4: Varying sizes, fixed number of nodes (8). numberr of nodes leads to a modest increase of the querying cost. Scanning iss very fast, much faster than using local main memory of a single node, 1366 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER sincee it is executed in parallel over the SDDSs nodes. Join cost increases slowly,, linear to the size of the transported data. Thee last experiment keeps the ratio of entries and number of nodes con• stant.. 4M entries are stored at each node. The file size varies from 4M entries too 32 M entries, giving 32MBytes to 256MBytes and 1 to 8 nodes.

60000 0 TT i 1 r

50000 0

40000 0

££ 30000

20000 0

10000 0

Q-- -e D D 0 0 j** 1 *£ I ü I I L.v 00 50 100 150 200 250 300 [MBytes] ]

Figuree 6.5: Scale up values; Varying number of nodes, each storing 32 MB.

Fromm Figure 6.5 we see that keeping the same amount of data on all nodes,, keeps the querying cost vaguely constant. The higher costs comes fromm the involvement of more nodes. Part of the time increase for joins is explainedd by the cost of distributing the t5 table to a larger number of nodes.. Observe that for join (big, t5) we can query nearly 3 times as muchh distributed data in about the same time compared to querying main- memoryy data, this is shown in Figure 6.6. Select (5) shows a slight increase fromm 174 ms, 1 node, to 273 ms for 8 nodes. Surprisingly, the time to execute selectt (1,10) is extra ordinarily constant3, around 2000 ms. However, an reasonablee stable time is not unlikely in view of the declining numbers in thee constant sized experiments (Figure 6.3), and the increasing numbers in thee constant nodes experiments (Figure 6.4).

6.5.44 Discussion Thee overall conclusions from the experiments are shortly:

•• A partition storing distributed data using an SDDS in Monet should nott exceed approximately 40 MB on our 64 MB machine, i.e. the mem• oryy free for user processes. This keeps the performance from degrading fromm main memory to disk based with trashing. 3Theree was one machine that dominated in response time, that gave the same best timee for different sizes of the table. 6.6.6.6. SUMMARY 137 7

60000 0 join(t5,, big) -»- ioin(big.tS)) -t-- 50000 0 join(bjg

40000 0

II 30000

20000 0

10000 0

dd / 0 0 v-- tt-v-x- XX: 00 50 100 150 200 250 300 [MBytes] ]

Figuree 6.6: Comparison between main-memory processing, and distributed memory. .

•• The SDDS related overhead added by our integration in a fully fledged DBMSS kernel is low. Scanning when employing an increasing number off nodes excel over the non-distributed case with perfect scalability.

•• For a fixed sized file, joining shows a moderate increase in the cost whenn a larger number of nodes is used.

•• When using a variety of different file sizes, for a fixed number of nodes, thee costs are higher for larger files. However, it increases much slower thann linearly. For example, comparing joins on 1 M entries (8MBytes) andd joins on 4 M entries (32 MBytes), the cost increase with only 33% too 46%, for our different joins..

•• A file can easily be scaled. I.e, a larger number of nodes is used for a largerr amount of data, keeping a constant load on each node. Querying off data is done vaguely in constant time, independent on the amount andd the number of nodes.

6.66 Summary

Thee prime novelties we have shown in this chapter is by analysis and im• plementationn that Scalable Distributed Data Structures provide a viable alternativee to conventional data distribution schemes, based on static hash- andd range-fragmentation. LH*, a well-known SDDS, has been integrated withh an extensible database system Monet. The key relational operators weree made SDDS aware, such that the query optimizer is relieved from the 1388 CHAPTER 6. SCALABLE DISTRIBUTED STORAGE MANAGER expensivee task to a priori select the 'best' data fragmentation and distribu• tionn scheme. Wee identified the core requirements on an extensible database system, inn order for an SDDS to be integrated. In our implementation platform, Monet,, the integration was facilitated by the already-present standard mod• ularr extensibility. The extension module added relational operators, that are ablee to cope with the distributed SDDS-based tables. In the end, it means thatt distributed data storage can be treated without any textual/syntactical changess to the queries compared to native (local) data. Thee performance experiments demonstrate that the overhead incurred byy the SDDS itself is minimal. The bulk processing cost stems from moving largee fragments of data around. However, in most realistic cases of distribu• tion,, querying distributed memory is faster than accessing disk-based data. Thiss makes it possible to use workstation clusters to provide for database storage. . Thiss study is currently expanded towards a run-time optimization tech• niquee based on a cost-model, and plans to perform further experiments using queriess from TPC-D on an SP/2 platform using our SDDS storage. Chapterr 7 Summaryy & Future Issues

7.11 Summary

Thiss thesis is focused on developing scalable distributed data structures and indicatee their applicability to database systems. We first introduced the areaa of data structures in general, then focusing on LH*, ending up with thee concurrent splitting LH*LH structure implementation on the Parsytec machine.. To show the viability of SDDSs in combination with database sys• tems,, we implemented LH* as a user module in Monet enabling transparent distributionn on the database interaction level. Subsequently we furthered the SDDSss into the area of spatial data introducing hQT*, a novel schema that allowss "hashing" performance access to local spatial data avoiding previously knownn problems with quad-trees and spatial hashing structures. We showed howw the basic structure easily could be SDDS enabled using the same basic navigationn algorithm as in local navigation. Performance was simulated mea• suringg the overhead of differently efficient caching schemas. Finally, the Q,- storagee was presented, aimed at large main memory machines. The ^-storage iss attractive for automatically organizing and indexing records allowing for highlyy efficient incomplete record matching. Performance is compared to kd- treee and table scanning. The structure avoids some known problems with thee kd-tree, and uses bit-wise navigation for easy organization of the data.

7.22 Extensions to this thesis

Ann important issue for all distributed data is availability. Ultimately, thee importance of availability depends on the application area's require• ments.. For LH* this has been addressed creating scalable high availabil- ityity schemes [LN97] [LN96a]. Furthermore, a highly distributed scenario is expectedd to give an increase in the research of "approximate" query opti• mization,, featuring incrementally improving results with an estimation of thee confidence [Hel97] [BDF+97]. Transactions spanning several distributed

139 9 140 0 CHAPTERCHAPTER 7. SUMMARY & FUTURE ISSUES relationss involving many sites put strains on available algorithms for trans• actionn and lock management.

7.33 Future work

Thee work of integrating SDDSs with a database system also requires in• tegrationn with a query optimizer. We propose to use Live Optimization (brieflyy outlined in [KK97]) for simplifying the complexity of distributed queryy optimization. Querying distributed data can in the general case be costly.. Involving large amounts of data copying and restructuring of data forr efficient query(join) processing. It avoids the weak point of global opti• mization,, which does not deal with changes of optimization premises, and thee problem of correctly estimating intermediate query result sizes [IC91]. Essentially,, i.e. an executable (sub)plan is selected at runtime using a cost- basedd dynamic decision. The decision is influenced by several parameters, suchh as cardinality of intermediate results, sizes of attributes, and distribu• tionn criterias. We see opportunities that allows for automatic redistribution, materializationn of data inbetween the operators applied to excel in perfor• mance. . Furthermore,, SDDS are very apt for providing potential solutions to aa number of applications areas. Consider for example current web-servers; Thesee are difficult to maintain, have limited availability, often manually organized,, not to good at adopting to varying loads. It is clear that a web• serverr implemented using SDDSs is viable. The Ninja Project have simliar applicationss of SDDSs in mind [GBHC00]. Another, appealing idea is to unifyy file storages, using URI:s to locate data. It requires both a scalable (andd distributed) directory service, as well as storage. On the smaller scale, onee corporation could internally provide a self organizing distributed unified storage,, that would hide actual details as mount-points/hosts/disks by using aa SDDS extended file system. Trivially, such a system could be implemented usingg the NFS protocol featuring a local NFS demon, which would calculate thee appropriate storage node using an SDDS addressing schema. If a disk iss overloaded then it could decide to move some of its data to another disk andd the SDDS schema would ensure efficient locating and retrieval. Itt is also an interesting quest to search for "THE Datastructure", which wouldd be scalable in the number of attributes, allow for NULL-values, and efficientlyy dynamically balance skewed access as well as skewed (input) data. Bibliography y

[ATWH94]] H. Afsarmanesh, F. Tuijnman, M. Wiedijk, and L.O. Hertzberger.. The Implementation Architecture of PEER Federatedd Object Management System. Technical re• port,, Department of Computer Systems, University of Am• sterdam,, September 1994. http://carol.wins.uva.nl/ net- peer/peer/doc/unicom.ps. .

[AvdBF+92]] Peter M. G. Apers, Carel A. van den Berg, Jan Flokstra, Paull W. P. J. Grefen, Martin L. Kersten, and Annita N. Wilschut.. PRISMA /DB: A Parallel Main Memory Rela• tionall DBMS. IEEE Transactions on Knowledge and Data Engineering,Engineering, 4(1):541 554, February 1992.

[BAC+90]] H. Boral, W. Alexander, L. Clay, G. Copeland, S. Danforth, M.. Franklin, B. Hart. M. Smith, and P. Valduriez. Prototyping Bubba,, A Highly Parallel Database System. IEEE Transac- tionstions on Knowledge and Data Engineering, 2(1):4 24, March 1990. .

[BCV91]] B. Bergsten, M. Couprie, and P. Valduriez. Prototyping DBS3, aa Shared-Memory Parallel Database System. In Proceedings ofof First International Conference on Parallel and Distributed InformationInformation Systems, pages 226 234, Miami Beach, Florida, Decemberr 1991.

[BD83]] H. Boral and DeWitt. Database Machines: An Idea Whose Timee Has Passed? A Critique of the Future of Database Ma• chines.. In International Workshop on Database Machines, vol• umee 3, pages 166 187, Munich, 1983.

[BDF+97]] D. Barbara, W. DuMouchel, C. Faloutsos, P.J. Haas, J.M.. Hellerstein, Y. Ioannidis, H.V. Jagadish, T. Johnson, R.. V. Poosala Ng, K.A. Ross, and K.C. Sevcik. The new jersey dataa reduction report. Bulletin of the IEEE Computer Soci- etyety Technical Committee on Data Engineering, 20(4):25 34, Septemberr 1997.

141 1 1422 BIBLIOGRAPHY

[Ben75]] Jon Louis Bentley. Multidimensional binary search trees usedd for associative searching. Communications of the ACM, 18(9):5099 517, September 1975.

[Ben79]] Jon L. Bentley. Multidimensional binary search trees in databasee applications. IEEE Transactions on Software En- gineering,gineering, SE-5(5):333 340, July 1979.

[BK95]] Peter A. Boncz and Martin L. Kersten. Monet: An Impres• sionistt Sketch of an Advanced Database System. In Basque In- ternationalternational Workshop on Information Technology: Data Man- agementagement Systems, San Sebastian (Spain), July 1995. IEEE.

[BKK96]] Peter A. Boncz, F. Kwakkel, and Martin L. Kersten. High Performancee Support for 00 Traversals in Monet. In British NationalNational Conference on Databases(BNCOD'96), 1996.

[BM72]] Rudolf Bayer and Edward M. McCreight. Organization and Maintenancee of Large Ordered Indices. Acta Informatica, 1:1733 189, 1972.

[BMK99]] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Ar• chitecturee Optimized for the New Bottleneck: Memory Access. Inn Proceedings of the International Conference on Very Large DataData Bases (VLDB), Edinburgh, United Kingdom, September 1999.. To appear.

[BQK96]] Peter A. Boncz, Wilko Quak, and Martin L. Kersten. Monet andd its Geographic Extensions: a Novel Approach to High Per• formancee GIS Processing. In Advances in Database Technology EDBT96,EDBT96, pages 147 166, Avignon, France, March 1996. Springer. .

[BRK98]] P. A. Boncz, T. Riihl, and F. Kwakkel. The Drill Down Bench• mark.. In Proceedings of the International Conference on Very LargeLarge Data Bases (VLDB), pages 628 632, New York, NY, Augustt 1998.

[BWK98]] Peter A. Boncz, Annita N. Wilschut, and Martin L. Ker• sten.. Flattening an object algebra to provide performance. Inn IEEE 14th International Conference on Data Engineering, pagess 568 577, Orlando, FL, USA, February 1998.

[CGK+90]] D. Chimenti, R. Gamboa, R. Krishnamurthy, S. Naqvi, S.. Tsur, and C. Zaniolo. The LDL System Prototype. IEEE TransactionsTransactions on Knowledge and Data Engineering, 2(1):76 89,, March 1990. BIBLIOGRAPHY BIBLIOGRAPHY 143 3

[Cor88]] Teradata Corporation. DBC/1012 data base computer con• ceptss and facilities. Technical Report Teradata Document C02-001-05,, Teradata Corporation, 1988.

[CRDHW74]] R. H. Canady, J. L. Rydery R. D. Harrisson, E. L. Ivie, and L.. A. Wehr. A back-end computer for data base management. CommunicationsCommunications of ACM, 17(10):572 582, October 1974.

[Cul94]] D. Culler. NOW: Towards Everyday Supercomputing on a Networkk of Workstations. Technical report, EECS Technical Reportss UC Berkeley, 1994.

[Day99]] Umeshwar Dayal. Industrial panel on data warehousing tech• nologies:: Experiences, challenges, and directions. In Mal• colmm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stan• leyy B. Zdonik, and Michael L. Brodie, editors, VLDB'99, Pro- ceedingsceedings of 25th International Conference on Very Large Data Bases,Bases, September 7-10, 1999, Edinburgh, Scotland, UK, page 725.. Morgan Kaufmann, 1999.

[DECM98]] Amalia Duch, Vladimir Estivill-Castro, and Contrado Mar• tinez.. Randomized k-dimensional binary search trees. In Lec- tureture Notes in Computer Science, volume 1533, pages 199 208, Taejon,, Korea, December 1998. Springer.

[Dev93]] R. Devine. Design and implementation of DDH: A distributed dynamicc hashing algorithm. In Procedings of the 4th Interna- tionaltional Conference on Foundations of Data Organization and AlgorithmsAlgorithms (FODO), 1993.

[DG92]] David DeWitt and Jim Gray. Parallel Database Systems: The Futuree of High Performance Database Systems. Communica- tionstions of the ACM, 35(6):85 98, 1992.

[DGG+86]] D. DeWitt, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M.. Muralikrishna. GAMMA: A high performance dataflow databasee machine. In Procedings of VLDB, August 1986.

[Dou90]] B. Dougherty. Telco's Strategic Importance in Tandem's Suc• cess.. Industry Viewpoint, 1990.

[Du84]] H. C. Du. Distributing a database for parallel processing is np-hard.. ACM SIGMOD Rec., 14(l):55-60, March 1984.

[FBF77]] Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel.. An algorithm for finding best matches in logarithmic expectedd time. ACM Transactions on Mathematical Software, 3(2):2099 226, September 1977. 144 4 BIBLIOGRAPHY BIBLIOGRAPHY

[FBY92]] William B. Frakes and Ricardo Baeza-Yates, editors. Informa- tiontion Retrieval: Data Structures & Algorithms. Prentice Hall, 1992. .

[FJP90]] J. C. French, A. K. Jones, and J. L. Pfaltz. Summary of thee Final Report of the NSF Workshop on Scientific Database Management.. In SIGMOD Record, volume 19:4, pages 32 40, Decemberr 1990.

[Fre87]] Michael Freeston. The BANG File: A New Kind of Grid File. Inn ACM SIGMOD: Special Interest Group on Management of DataData 1987, pages 260 269, San Fransisco, California, Decem• berr 1987.

[FRS93]] G. Fahl, T. Risch, and M. Sköld. AMOS - An Architecture for Activee Mediators. In IEEE Transactions on Knowledge and DataData Engineering, Haifa, Israel, June 1993.

[GBHC00]] Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and Davidd Culler. Scalable, Distributed Data Structures for Inter• nett Service Construction. In OSDI 2000: Fourth Symposium onon Operating Systems Design and Implementation, 2000.

[Har94]] Evan Philip Harris. Towards Optimal Storage Design for Effi- cientcient Query Processing in Relational Database Systems. Phd- thesiss tech report 94/31, University of Melbourne, Novem• ber/Mayy 1994.

[HCS99]] Joe Hellerstein, Mike Carey, and Mike Stonebraker. Tele• graph:: A universal system for information. WEB, 1999. http://db.cs.berkeley.edu/telegraph/. .

[Hel97]] J.M. Hellerstein. Online processing redux. Bulletin of the IEEEIEEE Computer Society Technical Committee on Data Engi- neering,,neering,, 20(3):20 29, September 1997.

[HKMT95]] M. Holsheimer, M. L. Kersten, H. Mannilla, and H. Toivonen. AA Perspective on Databases and Data Mining. In Knowledge DiscoveryDiscovery in Database^ö, Montreal, Canada, 1995.

[HSW88]] Andreas Hutflesz, Hans-Werner Six, and Peter Widmayer. Globallyy Order Preserving Multidimensional Linear Hashing. Inn ICDE: Fourth International Conference on Data Engineer- ing,ing, pages 572 579, Los Angeles, California, 1988.

[HSW89]] Andreas Henrich, Hans-Werner Six, and Peter Widmayer. Thee LSD tree: Spatial Access to Multidimensional Point and BIBLIOGRAPHY BIBLIOGRAPHY 145 5

Nonpointt Objects. In Fifteenth International Conference on VeryVery Large Data Bases, Amsterdam, The Netherlands, August 1989. .

[IC91]] Yannis E. Ioannidis and S. Christodoulakis. On the propa• gationn of errors in the size of join results. In International ConferenceConference on Management of Data. ACM-SIGMOD, June 1991. .

[IEE92]] IEEE. IEEE Standard for Scalable Coherent Interface (SCI). IEEE,, 1992. http://www.SCIzzL.com/.

[Jön99]] Henrik André Jönsson. Indexing time-series data using text indexingindexing methods. Licentiate Thesis No. 723, Department of Computerr Science and Information Science, Linköping Uni• versity,, 1999.

[Kar94]] Jonas S. Karlsson. An Implementation of Transaction Logging andd Recovery in a Main Memory Resident Database System. Master'ss thesis, Department of Computer Science and Infor• mationn Science, Linköping University, 1994.

[Kar97]] Jonas S Karlsson. A Scalable Data Structure for a Parallel DataData Server. Licentiate Thesis No. 609, Department of Com• puterr Science and Information Science, Linköping University, 1997. .

[Kar98]] Jonas S Karlsson. hQT*: A Scalable Distributed Data Struc• turee for High-Performance Spatial Accesses. In Katsumi Tanakaa and Shahram Ghandeharizadeh, editors, FODO'98: TheThe 5th International Conference on Foundations of Data Or- ganization,ganization, pages 37 46, Kobe, Japan, November 1998.

[KarOO]] J. S. Karlsson. Omega-storage: A Self Organizing Muli- attributee Storage Technique for Large Main Memories. In AustralasianAustralasian Database Conference, Canberra, Australia, Jan• uaryy 2000. IEEE Computer Society Press. Accepted for pub• lication. .

[KK97]] Jonas S Karlsson and Martin L. Kersten. Scalable Storage forr a DBMS using Transparent Distribution. Technical Re• portt INS-R9710, ISSN 1386-3681, CWI (The Dutch Centre forr Mathematics and Computer Science, 1997.

[KK98]] Jonas S Karlsson and Martin L. Kersten. Transparent Dis• tributionn in a Storage Manager. In The 1995 International ConferenceConference on Parallel and Distributed Processing Techniques 146 6 BIBLIOGRAPHY BIBLIOGRAPHY

andand Applications (PDPTA '98), Las Vegas, Nevada, USA, July 1998. .

[KK99]] Jonas S. Karlsson and Martin L. Kersten. An Exploration off the Omega-Storage Design Space. Unpublished Technical Report,, CWI, The Netherlands, 1999.

[KLR+94]] J. S. Karlsson, S. Larsson, T. Risch, M. Sköld, and M. Werner. AMOSAMOS User's Guide. CAELAB, IDA, IDA, Deptartment off Computer Science and Information Science, Linköping University,, Sweden, memo 94-01 edition, March 1994. http://www.ida.liu.se/labs/edslab/amos/amosdoc.html. .

[KLR96]] Jonas S Karlsson, Witold Litwin, and Tore Risch. LH*LH: A Scalablee High Performance Data Structure for Switched Multi- computers.. In Advances in Database Technology EDBT'96, pagess 573 591, Avignon, France, March 1996. Springer.

[Knu]] Donald E. Knuth. The Art of Computer Programming, vol• umee 3, chapter 6.4, pages 513 558. Second edition.

[KS86]] H.-R Krigel and B. Seeger. Multidimensional Order Preserv• ingg Linear Hashing with Partial Expansions. In International ConferenceConference on Database Theory, pages 203 220, Rome, 1986.

[KS88]] H.-P. Krigel and B. Seeger. PLOP-Hashing: A Grid File withoutt Directory. In ICDE: Fourth International Conference onon Data Engineering, pages 369 376, Los Angeles, California, 1988. .

[KTM084]] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Architec• turee and performance of relational algebra machine GRACE. Inn Proceedings of the Intl. Conference on Parallel Processing, Chicago,, 1984.

[KW94]] Birgitte Kröll and Peter Widmayer. Distributing a Search Treee Among a Growing Number of Processors. In ACM SIG- MODMOD Conference on the Management of Data, Mineapolis, 1994. .

[KW95]] Birgitte Kröll and Peter Widmayer. Balanced Distributed Searchh Trees Do Not Exists. In WADS Conference, Mineapo• lis,, 1995.

[Lar78]] P.A. Larson. Dynamic hashing. BIT, 18(2):184 201, 1978.

[Lar88]] P.A. Larson. Dynamic hash tables. In Communications of the ACM,ACM, volume 31(4), pages 446 57. April 1988. BIBLIOGRAPHY BIBLIOGRAPHY 147 7

[LER92]] Ted G. Lewis and Hesham El-Rewini. Introduction to Parallel Computing.Computing. Number ISBN 0-13-498916-3. Prentice Hall, 1992.

[Lit80]] W. Litwin. Linear Hashing: A new tool for file and table addressing.. In Procedings of VLDB, Montreal, Canada, 1980.

[Lit94]] W. Litwin. Linear Hashing: A new tool for file and ta• blee addressing. In Michael Stonebraker, editor, Readings in DATABASEDATABASE SYSTEMS, 2nd edition, pages 96 107. 1994.

[LN96a]] W. Litwin and M-A. Neimat. High-availability lh* schemes withh mirroring. In First IFCIS International Conference on CooperativeCooperative Information Systems (CoopIS'96), pages 196 205, Brussels,, Belgium, June 1996.

[LN96b]] W. Litwin and M-A Neimat. k-RP*: A Family of High Perfor• mancee Multi-attribute Scalable Distributed Data Structure. Inn IEEE International Conference on Parallel and Distributed Systems,Systems, PDIS-96, December 1996.

[LN97]] W. Litwin and M-A. Neimat. LH*s: A High-Availability andd High-Security Scalable Distributed Data Structure. In RIDE97:RIDE97: Seventh International Workshop on Research Issues InIn Data Engineering, Birmingham, England, 1997.

[LNS93]] W. Litwin, M-A Neimat, and D. Schneider. LH*: Linear hash• ingg for distributed files. In ACM-SIGMOD International Con- ferenceference On Management of Data, May 1993.

[LNS94]] W. Litwin, M-A Neimat, and D. Schneider. RP*: A Family off Order Preserving Scalable Distributed Data Structures. In ProcedingsProcedings of VLDB, 1994.

[LNS96]] W. Litwin, M-A. Neimat, and D. Schneider. LH*: A Scal• ablee Distributed Data Structure. ACM-TODS Transactions onon Database Systems, December 1996.

[LR85]] M. D. P. Leland and W. D. Roome. The silicon database ma• chine.. In Procedings 4th International Workshop on Database Machines,Machines, pages 169 189, Grand Bahama Island, March 1985.

[LS89]] David B. Lomet and Betty Salzberg. hB-tree: A Robust Multi- Attributee Search Structure. In Fifth International Conference onon Data Engineering, Los Angeles, California, February 1989.

[MSP93]] A. Matrone, P. Schiano, and V. Puotti. LINDA and PVM: A comparsionn between two environments for parallel program• ming.. Parallel Computing, 19:949 957, 1993. 1488 BIBLIOGRAPHY

[Mun99]] Rudolf Munz. Usage Scenarios of DBMS. URL, Septem• berr 1999. http://www.dcs.napier.ac.uk/~vldb99/ Industrial- SpeakerSlides/SAPVLDB.pdf. .

[NHS84]] Jörg Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. Thee grid file: An adaptable, symmetric multikey file structure. AA CM Transactions on Database Systems (TODS), 9(1):38 71, Marchh 1984.

[Ore82]] Jack A. Orenstein. Multidimensional tries used for associa• tivee searching. Information Processing Letters, 14(4):150 157, Junee 1982.

[Ors96]] Kjell Orsborn. On Extensible And Object-Relational Database TechnologyTechnology for Finite Element Analysis Applications. Disser• tationn No. 452, Department of Computer Science and Infor• mationn Science, Linköping University, 1996.

[OS83]] Yutaka Ohsawa and Masao Sakauchi. The BD-tree a new- dimensionall data structure with highly efficient dynamic char• acteristics.. Information Processing 83, 1983. North-Holland, Amsterdam,, 539-544.

[Oto84]] Ekow J. Otoo. A Mapping Function for the Directory of a Multidimensionall Extendible Hashing. In Tenth International ConferenceConference on Very Large Data Bases, pages 493 506, Singa• pore,, 1984.

[Oto88]] Ekow J. Otoo. Linearizing the Directory Growth in Order Preservingg Extendible Hashing. In ICDE: Fourth Interna- tionaltional Conference on Data Engineering, Los Angeles, Califor• nia,, 1988.

[OV91]] M. Tamer Ozsu and Patrick Valduriez. Principles of Dis- tributedtributed Database Systems. Number ISBN 0-13-715681-2. Prenticee Hall, 1991.

[Par94]] Parsytec Computer GmbH. Programmers Guide, Parix 1.2- PowerPC,PowerPC, 1994.

[Pet93]] M. Pettersson. Main-Memory Linear Hashing - Some En• hancementss of Larson's Algorithm. Technical Report LiTH- IDA-R-93-04,, ISSN-0281-4250, Department of Computer Sci• encee and Information Science, Linköping University, 1993.

[PGK88]] David A. Patterson, Garth A. Gibson, and Randy H. Katz. AA Case for Redundant Arrays of Inexpensive Disks (RAID). BIBLIOGRAPHY BIBLIOGRAPHY 149 9

Inn ACM SIGMOD International Conference on Management ofof Data, pages 109 116, Chicago, Illinois, USA, June 1988. SIGMOD. .

[Riv74]] Ronald Linn Rivest. Analysis of Associative Retrieval Algo- rithms.rithms. PhD-thesis STAN-CS-74-415, Computer Science De• partment,, Standford University, May 1974.

[Ron98]] Mikael Ronström. Design and Modelling of a Parallel Data ServerServer for Telecom Applications. PhD-thesis 1998:520, De• partmentt of Computer Science and Information Science, Linköpingg University, 1998.

[Sam89]] Hanan Samet. The Design and Analysis of Spatial Data Struc- tures.tures. January 1994 edition, 1989.

[SAP+96]] M. Stonebraker, P. M. Aoki, A. Pfeffer, A. Sah, J. Sidell, C.. Staelin, and A. Yu. MARIPOSA: A Wide-Area Distributed Databasee System. VLDB Journal, 5(1):48 63, January 1996. http://epoch.CS.Berkeley.EDU:8000/mariposa/papers/s2k-- 95-63.ps. .

[SAS95]] Vineet Singh, Minesh Amin, and Donovan Schneider. An Adaptive,, Load Balancing Parallel Join Algorithm. Techni• call Report HPL-95-46, Hewlett-Packard Labs, 1995.

[SFGM93]] M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The Se• quoiaa 2000 Storage Benchmark. In 19th ACM SIGMOD Con- ferenceference on the Management of Data, Washington DC, USA, Mayy 1993.

[SKPO88]] M. Stonebraker, R. Katz, D. Patterson, and J. Ousterhout. Thee Design of XPRS. In VLDB Conference, volume 14, pages 3188 330, Los Angeles, California, 1988.

[SM96]] Michael Stonebraker and Dorothy Moore. Object-Relational DBMSs:DBMSs: The Next Great Wave. Number ISBN 1-55860-397-2. Morgann Kaufmann Publishers, INC., San Francisco, Califor• nia,, 1996.

[SPW90]] C. Severance, S. Pramanik, and P. Wolberg. Distributed linear hashingg and parallel projection in main memory databases. In ProceedingsProceedings of the 16th International Conference on VLDB, Brisbane,, Australia, 1990.

[Ste94]] W. Richard Stevens. TCP/IP Illustrated Volume 1. Addison- Wesley,, 1994. 150 0 BIBLIOGRAPHY BIBLIOGRAPHY

[Tam81]] Markku Tamminen. Order Preserving Extendible Hashing and Buckett Tries. BIT, 21(4):419 435, 1981.

[Tan87]] Tandem. Nonstop sql - a distributed high-performance, high- availabilityy implementation of sql. In Procedings International WorkshopWorkshop on High Performance Transaction Systems, pages 3377 341, Asilomar, Calif., September 1987.

[Tan95]] Andrew S. Tanenbaum. Distributed Operating Systems. 1995.

[Tim]] TimesTen. TimesTen Performance Software. http://www.timesten.com/. .

[Tor95]] 0ystein Torbj0rnsen. Multi-Site Declustering Strategies for VeryVery High Database Service Availability. PhD-thesis 1995:16, Departmentt of Computer Systems and Telematics, Faculty of Electricall Engineering and Computer Science, Norwegian In• stitutee of Technology, University of Trondheim, Norway, 1995.

[WBW94]] R. Wingralek, Y. Breitbart, and G. Weikum. Distributed filee organisation with scalable cost/performance. In ACM- SIGMODSIGMOD International Conference On Management of Data, Mayy 1994.

[WSB98]] Roger Weber, Hans-J. Schek, and Stephen Blott. A quan• titativee analysis and performance study for similarity-search methodss in high-dimensional spaces. In Proceedings of the 24th VLDBVLDB Conference, pages 194 205, New York, USA, 1998.