UvA-DARE (Digital Academic Repository)

Scalable distributed data structures for database management

Karlsson, S.J.

Publication date 2000

Link to publication

Citation for published version (APA): Karlsson, S. J. (2000). Scalable distributed data structures for database management.

General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:01 Oct 2021 Chapterr 1 Preliminaries s

Wee start out by giving an introduction to SDDSs followed by a description of LH*,, which generated this track of data storage structures. We will motivate theirr existence and point out some possible application areas, subsequently wee describe common (basic) data structures (known to most readers) their behaviorr and properties in perspective of SDDSs. This defines the terminol- ogyy and it allows us later to describe design choices more accurately.

1.11 Birthground of SDDSs

Inn traditional distributed files systems, in implementations like NFS or AFS, aa file resides entirely at one specific site. This presents obvious limitations. Nott only on the size of the file, but also on the access performance scalabil- ity.. To overcome these limitations, distribution over multiple sites has been used.. One example of such a scheme is round-robin [Cor88] where records off a file are evenly distributed by rotating through the nodes when records aree inserted. The hash-declustering method of [KTM084] assigns records to nodess on the basis of a hashing function. The range-partitioning method of [DGG+86]] divides key values into ranges and different ranges are assigned too different nodes. A common aspect of these schemes is their static behav- ior,, which means that the declustering criterion does not change over time. Hence,, updating a directory or declustering function is not required. The pricee to pay is that the file cannot expand over more sites than initially allocated. . Too overcome this limitation of static schemes, dynamic partitioning is used.. The first such scheme is DLH [SPW90]. This scheme was designed for aa shared-memory system. In DLH, the file is in RAM and the file parameters aree cached in the local memory of each processor. The caches are refreshed selectivelyy when addressing errors occur and through atomic updates to all thee local memories at some points. DLH appears impressively efficient for highh insertion rates.

21 1 22 2 CHAPTERCHAPTER 1. PRELIMINARIES

1.22 SDDSs

SDDSss were proposed for distributing files in a network multi-computer environment,, hence without a shared-memory. The first scheme was LH** [LNS93]. Distributed Dynamic Hashing (DDH) [Dev93] is another SDDS,, based on Dynamic Hashing [Lar78]. The idea with respect to LH* iss that DDH allows greater splitting autonomy by immediately splitting overflowingg buckets. One drawback is that while LH* limits the number of forwardingss to two1 when the client makes an addressing error, DDH may usee 0(log2 N) forwardings, where iV is the number of buckets in the DDH file. file. [WBW94]] extends LH* and DDH to more efficiently control the load of aa file. The main idea is to manage several buckets of a file per server while LH** and DDH have basically only one bucket per server. One also controls thee server load as opposed to the bucket load for LH*. Bothh [KW94] and [LNS94] propose primary key ordered files. In [KW94] thee access computations on the clients and servers use a distributed binary searchh , whereas the SDDSs in [LNS94], collectively termed RP*, use broadcastt or distributed n-ary trees. It is shown that both kinds of SDDSs alloww for much larger and faster files than the traditional ones.

1.33 Requirements from SDDSs

SDDSss (Scalable Distributed Data Structures) such as a distributed vari- antt of Linear Hashing, LH* [LNS96], and others [Dev93][WBW94][LNS94], openss up new areas of storage capacity and data access. There are three requirementss for an SDDS:

First, it should have no central directory to avoid hot-spots.

Second, each client should have some approximate image of how data iss distributed. This image should be improved each time a client makes ann addressing error.

Third, if the client has an outdated image, it is the responsibility of thee SDDS to forward the data to the correct data server and to adapt thee client's image.

SDDSss are good for distributed computing since they aim at minimizing the communicationn and in turn minimize the response time, and enable more efficientt use of the processor time. Inn light of LH [Lit80] and LH* [LNS96] the following terms are used: Thee data sites termed servers can be used from any number of autonomous sitess termed clients. To avoid a hot-spot, there is no central directory for 1Inn theory, communication delays could trigger more forwarding [WBW94], 1.4.1.4. DATA STRUCTURES BACKGROUND 23 3 thee addressing across the current structure of the file. Each client has its ownn image of this structure. An image can become outdated when the file expands.. The client may then send a request to an incorrect server. The serverss forward such requests, possibly in several steps, towards the correct address.. The correct server appends to the reply a special message to the client,, called an Image Adjustment Message (IAM). The client adjusts its image,, avoiding repetition of the error. A well-designed SDDS should make addressingg errors occasional and forwards few, and should provide for the scalabilityy of the access performance when the file grows. A typically SDDS scenarioo has many more clients than servers, and that clients are reasonably active,, i.e., a hundred or more interactions in the lifetime of a client.

1.44 Data Structures — Background

InIn this section we give a short overview on commonly used indexing data structuresstructures for indexing. We start out by introducing desirable features of suchsuch data structures. In a distributed scenario using SDDSs these desired propertiesproperties are of even higher importance. The workings of distributed data structuresstructures are not the main topic in this section, thus it can be skipped by thethe expert on data structures.

Dataa in DBMSs is organized using Data Structures, also referenced to as accesss structures, access paths, accelerators, indices, or indexing structures. Wee identify three important properties for a "good" . The firstt is that the application accesses to the the individual elements encoded in thee data structure should be fast, i.e. insertion or retrieval should be efficient (access(access overhead). Secondly, the storage overhead, i.e. the extra storage space neededd for organizing the data and improve the access speed should be low. Third,, the data structure should be able to handle the amount of data that iss needed scalable, i.e. the structure should dynamically adapt to different storagee sizes without deteriorated performance.

1.4.11 Retrieval Methods Dataa is structured in records containing fields (attributes), i.e, bank account information.. Some fields are the target of retrieval called search keys. The searchh method depends on the inherit search characteristics which can be classifiedd along the lines.

Key retrieval (lookup)

Range retrieval (domain specific)

Approximate retrieval (sub-string, soundex) 24 4 CHAPTERCHAPTER 1. PRELIMINARIES

Predicate search (filtering)

Multi-dimensional searches (point, spatial, space, nearest)

Information retrieval (probabilistic)

Wheree some applications access data using one key, others may need too retrieve data in a certain range, which leads to range retrieval Approx- imateimate search is another type of retrieval, often user specified allowing for matchingg under a similarity measure. A special case of proximity is sub- stringg search, for example search addresses whose names contains the string "city";; Soundex search allows one to search for names of people that sounds likee "John" (Jon, John, Jonny, Jonni, Johnnie, Johnny). Soundex searching iss efficiently implemented by mapping the search key to a normalized (sound invariant)) spelling representation. Otherr kinds of retrieval might consider several fields simultaneously, of- tenn referred to as different attributes or dimensions. Example of some di- mensionss are spatial (x, ^-coordinate or in a common database it would bee Zip-code, Age, and Income. A multi-dimensional indexing structure al- lowss retrieval of data using several of these keys (dimensions) at the same time.. A more general case is predicate search, which allows the program to specifyy an arbitrary predicate, which, when invoked on the data, returns true/falsee value. If the predicate yields true the data is returned to the user/application. . InformationInformation retrieval sciences are not that strict, and employs aa scoring function which scores the data returning the the "best matches"" ranked with he best match first. Web search engines, such as http:: //www. altavista. com/, employs various searching and scoring meth- ods. .

1.4.22 Reasonable Properties

AA reasonable well-behaved and efficient data structure can be expected to fulfilll most of the following statements.

A data structure is a container that stores n items of data.

Each item is identified by its key(s) (algorithms typically assumes that thee keys are unqieue but in practice it is often relaxed).

A single insert of an item should ideally be done in constant time, but typicallyy O(logn) time is acceptable.

Lookups using unique keys is expected to be faster or at least exhibit thee same cost as inserts.

Iteration over all items in the data structure takes 0(n). 1.4.1.4. DATA STRUCTURES BACKGROUND 25 5

If the data structure supports ordered retrieval, i.e. previous and nextt operations, these are expected to take 0(1) time. Prom this fol- lowss that, starting at the first item applying the next operation on succeedingg items until the last is reached, should take 0(n). Localiza- tionn is then Ö(nlogn) at worst.

Complex querying ("searching") the data structure returning r items, optimallyy takes 0(r) time. Worst case, however, is 0(n). An effective structuree might perform pruning and achieve 0(r + logn). The effec- tivenesss of a search can be expressed by the overhead t/r, where t is thee number of tests, together with the pruning factor of (n - t)/n.

Most data structures are in practice limited by available memory, disk pagee sizes, etc. If the data structure can dynamically restructure itself andd keep the performance, then we say it is dynamic. Then, theoreti- cally,, it has no upper limit on the amount of items it can handle.

However, many dynamic structures deteriorate on skewed data, creat- ingg a structure where most of the data is stored in few places and the insertt and retrieval operations deteriorate in performance. The cause mayy be that the partitioning function used does not perform well or thatt the input data appears in a non-convenient order. Some structures balancesbalances themselves to avoid the deterioration problem.

A non-dynamic structure can be replaced by another non-dynamic structure,, which can hold a larger data and that avoids the deteri- oration/skeww experienced. This was the classic way to achieve dynamic structuress (rebuilding, typical variants are implemen- tations,, rehashing of hash-structures). One problem with this approach iss that in some cases it might require the same time for re-inserting all thee data stored in the previous structure as well as the double amount off space for that time. We will refer to structures which gracefully ac- commodatee these problems as being scalable. Examples include Linear Hashingg [Lit80], Dynamic Hashing [Lar78], and B-tree [BM72].

1.4.33 Basic Data Organization Thee simplest, and most common storage structure is the array, in most programmingg languages predetermined in size and type of data they store. Thee elements of an array are generally stored continuously in memory and accessedd using an implicit calculated index. Retrieving data is easy when thee position is known or when it can be calculated easily. When data can be retrievedd by direct look-up in an array structure we call it radix retrieval or directdirect addressing. Whenn the data is stored in a , we can use binary search too locate the exact position of the data requested. It works by iteratively 26 6 CHAPTERCHAPTER 1. PRELIMINARIES

Structure/Methods s Key y Ordered d Insert t Lookup p Search h Memory y Array y l..n l..n -- 1 1 1 1 0{n) 0{n) 0(n) 0(n) Array y l..n n Y Y 0(n) 0(n) n/2 2 0{n) 0{n) 0(n) 0(n) List t atom m -- 1 1 0(n) 0(n) 0(n) 0(n) 0(n) 0(n) Hashing g atom m -- 0(1) ) 0(1) ) 0(n) 0(n) 0(n) 0(n) Tree e atom m -- 0(log(n)) ) 0(log(n)) ) 0(n) 0(n)0(n0(n * log(n)) p id d -- 0(1) ) 0(1) ) -- 0{n) 0{n)

Tablee 1.1: Basic data structures features and complexity.

halvingg the array, choosing the half which would contain the key searched for.. This approach achieves O(logn) search time on uniform spaces. Even so,, accesses might be slow when searching large volumes of data, because of thee numerous comparisons to be made. Iff the ordering of the keyed data is not important one can employ a hash-structure.hash-structure. Hashing structures are commonly based on an array, where inn each position in the array, called a slot, one item of data can be stored (closedd hashing) [Knu]. The position of a record in the array is determined byy a hash-function which calculates a natural number using the key. The numberr is modified in such a way that it fits into the interval range of the array,, effectively reducing the problem to radix retrieval. Whenn two items hash to the same location a conflict resolution method is applied,, it can involve rehashing using another hash-function or just stepping throughh the array systematically looking for an empty slot.

1.4.44 Memory Management/Heaps Dataa of varying sizes are typically stored and managed by employing a heapheap structure. Examples include varying-length strings, pictures, sound, andd memo-fields in a DBMS. The heap structure manages an area of this memory,, which may be preallocated to the application. It keeps tracks of thosee areas of memory that are in use and which are not. An area of memory iss allocated to store the data. It will occupy somewhat more memory (storage overhead).overhead). When the data is not needed anymore, it can be given back to thee heap by deallocating/freeing it, for later reuse. Heapss for main memory management experience problems with heap fragmentation,fragmentation, where allocated memory blocks of varying sizes are not stored inn adjacent memory locations, i.e., a situation where a request for mem- oryy cannot be granted because there is no single continuous block of mem- oryy available. This problem is traditionally attacked by Garbage Collection methods,, which compacts the used memory, removing the "holes" of unused memory.. Other methods employs more elaborate allocation schemas. 1.4.1.4. DATA STRUCTURES BACKGROUND 27 7

1.4.55 Linked Lists Thee drawback of array storage is that its size often is predefined at compile time.. When more data is stored the array overflows, programs may mal- functionn or even crash with disastrous effects. Therefore dynamic allocation off memory and data storage is essential to match the current need of the application.. A can be viewed as a number of items linked together intoo a chain by storing additional information, pointers, which for each link pointss to the "next" item. Searching for a key in a long lists is "slow", since it,, in the basic configuration, requires scanning the list from the beginning tilll the matching data is found. However, a list has the advantage that items easilyy can be added or removed without having to move other items. The mainn advantage with a linked list implementation is that inserts does not needd to move any data.

1.4.66 Chained/Closed Hashing Linkedd lists are often used in combination with hashing, to allow every radix positionn (slot in the array) to store several data items, linked together into aa linked list. Every such slot is then said to point to a Bucket. Another possibilityy to implement a bucket is to associate a memory segment/disk pagee with a slot in the hashing array. Hashing is analyzed in [Knu]. Still,, if the volume of the inserted data is very large, much larger than foreseenn for the hashing array, the problem again deteriorates to searching a linkedd list, since too many items are stored per slot. Normally, the average numberr of items stored per slot is limited. When this limit is exceeded, a neww larger array is created and all the items are moved (inserted) into the neww hash structure.

1.4.77 Trees TreeTree structures, allow efficient inserts, deletes and retrievals. A tree contains twoo type of nodes, branch-nodes which are used for organizing the data, and leaf-nodesleaf-nodes which stores the data. A branch-node is a choice-point, where a choicee is made between a number of branches. In the simplest case where eachh node has two branches, a node can be characterized by a value. Data withh a lower value is stored in the sub-tree found by following the left hand branch,, higher values are found in the left-hand branch in the same way. Thee root-node is the "first" node of the tree. By navigating, starting at the roott node, and traversing branch-nodes a leaf-node is eventually reached. The,, on the avererage, best performing tree under uniform access dis- tributions,, is balanced tree, in which all leaf-nodes are at the same distance fromm the root-node. Such a tree allows inserts and lookups in O(logn) time. Inn a balanced tree, all the branches (including their reachable subtrees) off a node has the same weight. The "weight" is the number of items that are 28 8 CHAPTERCHAPTER 1. PRELIMINARIES reachablee through the branch. The worst case scenario for a non-balancing treee is where one of the branches recursively has the largest weight. In such aa skewed tree, for example a , each branch-node would have one leaf-nodee and another branch-node. The distance to the root for the last leaf-nodee is then linear to the number of items, giving the search time 0(n). However,, with random ordered input data, it is highly unlikely that the treee deteriorates this much, and on the average, the navigation is said to be O(logn). . Onee reason that non-balancing trees have such a bad worst case per- formancee is that they do not dynamically adjust the subtrees of the nodes. Insteadd the nodes split criteria was fixed when the node was created. Thee AVL-tree is a binary balancing tree. It dynamically adjusts the branchh nodes, so that all leaves are kept at roughly the same distance from thee root. B-trees are another example of a balancing tree, but it allows tuning off the number of branches for the nodes. B-trees are popular for disk-based storagee and DBMSs. Quad-treesQuad-trees [Sam89] allow for spatial data, i.e. data points with a coordi- natee (x,y). There are several variants of Quad-trees, some allows dynamic choicee of split-value for a branch-node, and others where this value is pre- determined.. The dynamic choice can lead to highly skewed trees, depending onn insertion order. Whereas, the predetermined variant may create unnec- essarilyy deep sub-trees, for highly clustered data. Treess that are invariant of the insert order, i.e., independent on the insertt order of the items always yield a tree with the same structure. One examplee is a quad-tree using bits for its organization [Sam89]. In such a tree thee splitting criteria is predetermined. Each node's splitting criteria solely dependss on its position in the tree (height and nodes above). These trees cann experience problem with clustered data yielding deep branches, these can,, however, be efficiently be compacted [Sam89].

1.4.88 Signature Files AA signature file [FBY92] works as an inexact filter. It is mainly used in Infor- mationn Retrieval to index words in documents, but can be applied to other dataa items successfully too, such as time-series data [Jön99]. For each docu- mentt a number of signatures are stored in a file, each signature is viewed as ann array of bits of fixed size. Typically, each releveant word in the document iss hashed to a bit in the signature, setting this bit to 1, but other coding schemess can also used. A signature is often choosen so that approximately halff of the bits are set. To search for a word, all signatures are scanned. For signaturess that match, it is probable that the corresponding document con- tainss the word searched for. The document can then be retrieved and tested. Somee signatures matches even though the document does not contained the wordd searched for, these matches are called "false hits". A signature typi- 1.5.1.5. ROUNDUP 29 9 callyy consists of hash-coded bit patterns. Scanning of signature files is much fasterr than scanning the actual documents. Still, it is only a few orders of magnitudess faster. The space overhead is typically chosen to be 10% to 15%.

Structure/Methods s Positive e Negative e Array y ++ compact -- ordered inserts — 0(n) -- fixed size -- slack = 0(N - n) Dynamicc Array 4-- compact -- indirection overhead 4-- dynamic 4-- limited slack Memoryy Heap ++ dynamic sizes of data -- O(n) overhead ++ O(l) store/retrieval using "handle" -- deterioates w usage -- no search possible Linkedd List ++ dynamic -- O(n) overhead ++ insert in O(l) time -- non compact -- O(n) searches Openn Hashing ++ "O(l)" access time -- fixed size -- complex collision handling Closedd Hashing ++ "O(l)" access time -- fixed size -- buckets w slack/linked list/array Dynamicc Hashing ++ "O(l)" search time -- accumulated O(logn) insert time (LH) ) -- complex implementation Tree e ++ dynamic -- skew gives O(n) in worst case -- storage overhead n + rt/2. -- insert order sensitive Balancedd Tree 4-- dynamic -- rearranging cost {dynn array) ++ guaranteed O(logn) retrieval Signaturee Files ++ fast "O(l)" insert -- slow O(ti) search time ++ approximate -- no order -- storage overhead n

Tablee 1.2: Positive and negative properties of basic data structures.

1.55 Roundup

Tablee 1.2 displays a list of data structures and what I see as their most positivee and negative properties. The list is in no way complete, but it is providedd as summary of the discussions in this chapter.

1.66 LH* (1 dimensional data)

WeWe start out describing LH* [LNS93] the first full SDDS designed. LH* defineddefined the basics for SDDSs and inspired me and many others to boldly createcreate and explore areas where no man has gone before.

Wee will now describe the LH* SDDS, and later on we describe LH*LH. LH** is a data structure that generalizes Linear Hashing to parallel or distributedd RAM and disk files [LNS96]. One benefit of LH* over ordinary LHH is that it enables autonomous parallel insertion and access. The num- berr of buckets and the buckets themselves can grow gracefully. Insertion requiress one message in general and three in the worst case. Retrieval re- quiress at least two messages, possibly three or four. In experiments it has beenn shown that insertion performance is very close to one message (+3%) 30 0 CHAPTERCHAPTER 1. PRELIMINARIES andd that retrieval performance is very close to two messages (+1%). The mainn advantage is that it does not require a central directory for managing thee global parameters.

1.6.11 LH* Addressing Scheme Ann LH*-client is a process that accesses an LH* file on the behalf of the application.. An LH*-server at a node stores data of LH* files. An application cann use several clients to explore a file. This way of processing increases the throughput,, as will be shown in Section 2.6. Both clients and servers can be createdd dynamically.

ve e off hash function :orward d 33 7^\ 7^ // forwarii' \ 22 -

11 /*\ /\ /\\ T^K^'Niy^K"5 5 f-%r* f-%r* ^^ V «JJ- : ((: \ /"" "\ /-- ^ rr > rr \ ff \ >» »0 0 '3* * 1 1 2 2 --> 3 3 » » 4 4 -3>- 5 5 > 6 6 ï* 1 1 AA 8 V V

n n ylyl s' SS ' s'"" N Insert t DataServers s IAM M

DataClient t

levell = ] Client image pointerr = 0

Figuree 1.1: LH* File Expansion Scheme.

Att a server, one bucket per LH* file contains the stored data. The bucket managementt is described in Section 2.2. The file starts at one server and expandss to others when it overloads the buckets already being used. Thee global addressing rule in LH* file is that every key C is inserted to thee server sc, whose address s = 0,1, ...N - 1 is given by the following LH addressingg algorithm [Lit94]:

scsc := hi(C)

iff sc < then sc := hi+i(C), wheree i (LH* file level) and n (split pointer address) are file parameters evolvingg with splits. The hi functions are basically:

hi{C)hi{C) = C mod (21 xK),K= 1,2,. 1.6.1.6. LH* (1 DIMENSIONAL DATA) 31 1

andd K — 1 in what follows. No client of an LH* file knows the current i andd n of the file. Every client has its own image of these values, let it be i' andd n'; typically i' < % [LNS93]. The client sends the query, for example the

insertt of key C, to the address s'c(i',n').

Thee server s'c verifies upon query reception whether its own address s's'cc is s'c = sc using a short algorithm stated in [LNS93]. If so, the server processess the query. Otherwise, it calculates a forwarding address s"c using thee forwarding algorithm in [LNS93] and sends the query to server s"c. Server s"s"cc acts as s'c and perhaps resends the query to server s'^ as shown for Server 11 in Figure 1.1. It is proven in [LNS93] that then s£' must be the correct server.. In every case of forwarding, the correct server sends to the client ann Image Adjustment Message (IAM) containing the level i of the correct server.. Knowing the i and the SQ address, the client adjusts its i' and n' (seee [LNS93]) and from now on will send C directly to SQ.

1.6.22 LH* File Expansion Ann LH* file expands through bucket splits as shown in Figure 1.1. The next buckett to split is generally noted bucket n, n = 0 in the figure. Each bucket keepss the value of i used (called LH*-bucket level) in its header starting fromm i — 0 for bucket 0 when the file is created. Bucket n splits through the replacementt of ht with hl+\ for every C it contains. As a result, typically half off its records move to a new bucket AT, appended to the file with address nn + 2'. In Figure 1.1, N = 8. After the split, n is set to (n + 1) mod 2*. Thee successive values of n can thus be seen as a linear move of a split tokentoken through the addresses 0,0,1,0,1,2,3,0,..., 2* - 1,0,... The arrows of Figuree 1.1 show both the token moves and a new bucket address for every split,, as resulting from this scheme.

Splittingg Control Strategies Theree are many strategies, called split control strategies, that one can use too decide when a bucket should split [LNS96] [Lit94] [WBW94]. The overall goall is to avoid the file overloading. As no LH* bucket can know the global load,, one way to proceed is to fix some threshold 5 on a bucket [LNS96]. Buckett n splits when it gets an insert and the actual number of objects itit stores is at least 5". S can be fixed as a file parameter, but a potentially betterr performance strategy is to calculate 5 for bucket n dynamically using thee following formula: SS = M x V x =—, 22l l wheree i is the n-th LH*-bucket level, M is a file parameter, and V is the buckett capacity in number of objects. Typically one sets M to some value betweenn 0.7 and 0.9. 32 2 CHAPTERCHAPTER 1. PRELIMINARIES

Thee intuition behind the formula is as follows. A split to a new server shouldd occur at each M x V global insert into the data structure, thus aiming att keeping the mean load of the buckets constant;

globall number of inserts/number of server = constant.

Forr a server without any knowledge about the other servers it can only use itss own information, that is, its bucket number n and the level i, to estimate thee global load. It knows that any server < n, server 0..n - 1, has split into serverr 2\.2l -fn — 1 and both these thus have half the load of the servers that aree not yet split, servers n..2l - 1. The number of servers can be calculated too 2X + n, which gives us an estimated global load of

MM x V x (2* + n).

Serverss that were split or new servers have half the load, 5/2, of those that aree to split that have the load S. The n new servers come from n servers, totallyy 2xn servers with the load S/2, and 2t + n - 2 x n remaining servers too be split later with a load of S. The total of these servers can then be expressedd as \\ xS x 2 x n + S x (2l - n). Thiss can be simplified to 5 x 2'. Setting the global estimate equal to the lastt expression provides after some simplification

MM x V x (2* + n) = S x 2\

Solvingg for S gives the above expressed formula for S. Thee performance analysis in Section 2.6.1 shows indeed that the dynamic strategyy should be preferred in our context. This is the strategy adopted for LH*LH. .

1.6.33 Conclusion LHH is well-known for its scalability in handling a dynamic growing dataset andd the new distributed LH* is also proven scalable. Both of these hashing algorithmss use the actual bit representation of the hash values; these are givenn by the keys. Hashing in general can be seen as a radix sort in an intervall where each value has a bucket where it stores the items. LH can in turnn be viewed as a radix sort using the lower bits of the hash value for the keys.. It furthermore has an extra attribute that tells us the number of bits used,, and a splitting pointer. The splitting pointer allows gradual growth andd shrinkage of the range of values (number of buckets) used for the radix sort. . LH** is a variant of LH that enables simultaneous access from several clientss to data stored on several server nodes. One LH bucket corresponds 1.7.1.7. ORTHOGONAL ASPECTS 33 3 too the data stored on a server node. In spite of not having a central direc- tory,, the LH* algorithm allows for extremely fast update of the client's view soo that it will access the right server nodes when inserting and retrieving data.. LH* [LNS93] was one of the first Scalable Distributed Data Structure (SDDSs).. It generalizes LH [Lit80] to files distributed over any number of sites.. One benefit of LH* over LH is that it enables autonomous parallel insertionn and access. Whereas the number of buckets in LH changes grace- fully,, LH* lets the number of distribution sites change as gracefully. Any numberr of clients can be used; the access network is the only limitation for linearr scaleup of the capacity with the number of servers, for hashed access. Inn general, insertion requires one message, and in the worst case three mes- sagess might occur. Retrieval requires one more message. But the main issue iss that no central directory is needed for access to the data.

1.77 Orthogonal Aspects

Inn this section we list important properties for the data structures studied in thiss thesis. These properties should ultimately be independently availiable forr data storage. In practice this is not the case. For example, distribution or parallelismm gives better performance but generally decreases the availability. Moree dimensions give more overhead and/or worse performance.

1.7.11 Performance Itt is desired that single lookup/insert operations can be performed in "con- stantt time", however, in practice O(logn) usually suffices. When one or more parameterss are varied for a data structure, such as dimensions, distribution, availability,, or communication topologies, they will inevitably affect perfor- mance.. Disk I/O, as well as, cache-misses in RAM, should be avoided. In somee cases the actual CPU cycles may be of importance in common opera- tions,, such as scanning arrays of data.

1.7.22 Dimensions Forr classic data structures, only one-dimensional data is allowed. That is, onee key is used for retrieval and inserts. Twoo dimensions are also fairly well covered by literature. Many struc- turess combine the x and y values into one value, and use this value for index- ingg in a classic one-dimensional data structure. Some structures are based onn order preserving hashing, interleaving the x and y binary representations too form another value, a value later used in indexing a one-dimensional hash-filee (referred to as multi-dimensional hashing), or building a quad-tree stylee structures. Common operations on spatial (2-dimension) data struc- 34 4 CHAPTERCHAPTER 1. PRELIMINARIES turess involve point lookup, region retrieval, closest neighbors, or similarity retrievall [Sam89]. Itt is a well-known fact that most multi-dimensional data structures will sufferr from the multi-dimensional curse [WSB98], the performance degrades manyy orders of magnitude when the dimensionality increases. For similarity retrievall it has been observed [WSB98] that it is better to perform scanning overr the whole dataset, or using a compact signature file, than to try to use a multi-dimensionall data structure. It was shown that their scanning method alreadyy at 13 dimensions outperformed known efficient multi-dimensional dataa structures.

1.7.33 Overhead Dataa structures allow for efficient indexing, in order to accelerate retrieval. However,, storage overhead is not negliable in most indices, B-trees may storee duplicate keys in the internal nodes, and hashing structures often has somee slack to avoid worst cases. The storage overhead (unused space) is typicallyy around 50% for a B-tree, 10-15% for a hash structure. More so iss the performance affected by the CPU usage for navigating the index, or thee cost of calculating hash-keys. The compactness and locality in the mem- oryy navigation is another important concern on CPUs with large internal cachess [BMK99].

1.7.44 Distribution and Parallelism Too handle very large amounts of data, distribution or parallelism is tra- ditionallyy employed. Distribution of data, however, adds additional storage overhead,, often more calculations, as well as communication messaging. Par- allelismm using shared memory is limited by hardware architecture scale-up limitss but it avoids costly messaging using other means of synchronizations. Thee more data stored at more sites incurs more messages and more overhead inn accessing and organizing the data as well as processing it. B-treess and hash structures are preferred for creating distributed indices. Theyy can be used to administer and automatically decluster the data set over aa number of nodes. However, they are often static in their structure, allow- ingg only limited load balancing and perform poorly when the presumptions change. . SDDSs,, are in a way the "balancing" distributed data structures. They generallyy allows for retrievals in "near constant" time 0(1) ... O(logn), or ratherr a "near constant" number of messages on the average. The perfor- mancee for SDDSs is often assessed by simulations that counts the number off serial messages needed for data to be found or inserted. For example LH** [LNS93] has been reported to allow retrieval of a data distributed over hundredss of node, in less than 2.001 messages on average [LNS93]. Further- 1.7.1.7. ORTHOGONAL ASPECTS 35 5 more,, LH* limits the number of messages need for retrieval to 4. Other SDDSss do not guarantee an upper bound, but instead offers acceptable av- eragee performance.

1.7.55 Availability Availabilityy means that data can be made availiable when the it is needed, Itt may involve reconstruction of actual data by applying logs, or combining partiall replicas. Inn real-time systems, data structures are designed to give guaranteed performance,, both bounded in time as well as availability. Many "dynamic" structuress are more vague, quoting average performance values. Disk based dataa may be cached in main-memory or require random disk accesses, and mayy be delay, because other user accesses to the same disk. Inn the event of distributed systems, the task is even more difficult. Stor- agee nodes may be unavailable at times, because of hardware or software faults,, or network congestion. Some specialized networks are designed to be ablee to give promises about the performance (ATM). Common solutions for achievingg high availability use techniques such as RAID storage [PGK88], replication,, logging and hot standby, and failure recovery [Tor95]