Uva-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) Scalable distributed data structures for database management Karlsson, S.J. Publication date 2000 Link to publication Citation for published version (APA): Karlsson, S. J. (2000). Scalable distributed data structures for database management. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:01 Oct 2021 Chapterr 1 Preliminaries s Wee start out by giving an introduction to SDDSs followed by a description of LH*,, which generated this track of data storage structures. We will motivate theirr existence and point out some possible application areas, subsequently wee describe common (basic) data structures (known to most readers) their behaviorr and properties in perspective of SDDSs. This defines the terminol- ogyy and it allows us later to describe design choices more accurately. 1.11 Birthground of SDDSs Inn traditional distributed files systems, in implementations like NFS or AFS, aa file resides entirely at one specific site. This presents obvious limitations. Nott only on the size of the file, but also on the access performance scalabil- ity.. To overcome these limitations, distribution over multiple sites has been used.. One example of such a scheme is round-robin [Cor88] where records off a file are evenly distributed by rotating through the nodes when records aree inserted. The hash-declustering method of [KTM084] assigns records to nodess on the basis of a hashing function. The range-partitioning method of [DGG+86]] divides key values into ranges and different ranges are assigned too different nodes. A common aspect of these schemes is their static behav- ior,, which means that the declustering criterion does not change over time. Hence,, updating a directory or declustering function is not required. The pricee to pay is that the file cannot expand over more sites than initially allocated. Too overcome this limitation of static schemes, dynamic partitioning is used.. The first such scheme is DLH [SPW90]. This scheme was designed for aa shared-memory system. In DLH, the file is in RAM and the file parameters aree cached in the local memory of each processor. The caches are refreshed selectivelyy when addressing errors occur and through atomic updates to all thee local memories at some points. DLH appears impressively efficient for highh insertion rates. 21 1 22 2 CHAPTERCHAPTER 1. PRELIMINARIES 1.22 SDDSs SDDSss were proposed for distributing files in a network multi-computer environment,, hence without a shared-memory. The first scheme was LH** [LNS93]. Distributed Dynamic Hashing (DDH) [Dev93] is another SDDS,, based on Dynamic Hashing [Lar78]. The idea with respect to LH* iss that DDH allows greater splitting autonomy by immediately splitting overflowingg buckets. One drawback is that while LH* limits the number of forwardingss to two1 when the client makes an addressing error, DDH may usee 0(log2 N) forwardings, where iV is the number of buckets in the DDH file. file. [WBW94]] extends LH* and DDH to more efficiently control the load of aa file. The main idea is to manage several buckets of a file per server while LH** and DDH have basically only one bucket per server. One also controls thee server load as opposed to the bucket load for LH*. Bothh [KW94] and [LNS94] propose primary key ordered files. In [KW94] thee access computations on the clients and servers use a distributed binary searchh tree, whereas the SDDSs in [LNS94], collectively termed RP*, use broadcastt or distributed n-ary trees. It is shown that both kinds of SDDSs alloww for much larger and faster files than the traditional ones. 1.33 Requirements from SDDSs SDDSss (Scalable Distributed Data Structures) such as a distributed vari- antt of Linear Hashing, LH* [LNS96], and others [Dev93][WBW94][LNS94], openss up new areas of storage capacity and data access. There are three requirementss for an SDDS: First, it should have no central directory to avoid hot-spots. Second, each client should have some approximate image of how data iss distributed. This image should be improved each time a client makes ann addressing error. Third, if the client has an outdated image, it is the responsibility of thee SDDS to forward the data to the correct data server and to adapt thee client's image. SDDSss are good for distributed computing since they aim at minimizing the communicationn and in turn minimize the response time, and enable more efficientt use of the processor time. Inn light of LH [Lit80] and LH* [LNS96] the following terms are used: Thee data sites termed servers can be used from any number of autonomous sitess termed clients. To avoid a hot-spot, there is no central directory for 1Inn theory, communication delays could trigger more forwarding [WBW94], 1.4.1.4. DATA STRUCTURES BACKGROUND 23 3 thee addressing across the current structure of the file. Each client has its ownn image of this structure. An image can become outdated when the file expands.. The client may then send a request to an incorrect server. The serverss forward such requests, possibly in several steps, towards the correct address.. The correct server appends to the reply a special message to the client,, called an Image Adjustment Message (IAM). The client adjusts its image,, avoiding repetition of the error. A well-designed SDDS should make addressingg errors occasional and forwards few, and should provide for the scalabilityy of the access performance when the file grows. A typically SDDS scenarioo has many more clients than servers, and that clients are reasonably active,, i.e., a hundred or more interactions in the lifetime of a client. 1.44 Data Structures — Background InIn this section we give a short overview on commonly used indexing data structuresstructures for indexing. We start out by introducing desirable features of suchsuch data structures. In a distributed scenario using SDDSs these desired propertiesproperties are of even higher importance. The workings of distributed data structuresstructures are not the main topic in this section, thus it can be skipped by thethe expert on data structures. Dataa in DBMSs is organized using Data Structures, also referenced to as accesss structures, access paths, accelerators, indices, or indexing structures. Wee identify three important properties for a "good" data structure. The firstt is that the application accesses to the the individual elements encoded in thee data structure should be fast, i.e. insertion or retrieval should be efficient (access(access overhead). Secondly, the storage overhead, i.e. the extra storage space neededd for organizing the data and improve the access speed should be low. Third,, the data structure should be able to handle the amount of data that iss needed scalable, i.e. the structure should dynamically adapt to different storagee sizes without deteriorated performance. 1.4.11 Retrieval Methods Dataa is structured in records containing fields (attributes), i.e, bank account information.. Some fields are the target of retrieval called search keys. The searchh method depends on the inherit search characteristics which can be classifiedd along the lines. Key retrieval (lookup) Range retrieval (domain specific) Approximate retrieval (sub-string, soundex) 24 4 CHAPTERCHAPTER 1. PRELIMINARIES Predicate search (filtering) Multi-dimensional searches (point, spatial, space, nearest) Information retrieval (probabilistic) Wheree some applications access data using one key, others may need too retrieve data in a certain range, which leads to range retrieval Approx- imateimate search is another type of retrieval, often user specified allowing for matchingg under a similarity measure. A special case of proximity is sub- stringg search, for example search addresses whose names contains the string "city";; Soundex search allows one to search for names of people that sounds likee "John" (Jon, John, Jonny, Jonni, Johnnie, Johnny). Soundex searching iss efficiently implemented by mapping the search key to a normalized (sound invariant)) spelling representation. Otherr kinds of retrieval might consider several fields simultaneously, of- tenn referred to as different attributes or dimensions. Example of some di- mensionss are spatial (x, ^-coordinate or in a common database it would bee Zip-code, Age, and Income. A multi-dimensional indexing structure al- lowss retrieval of data using several of these keys (dimensions) at the same time.. A more general case is predicate search, which allows the program to specifyy an arbitrary predicate, which, when invoked on the data, returns true/falsee value. If the predicate yields true the data is returned to the user/application. InformationInformation retrieval sciences are not that strict, and employs aa scoring function which scores the data returning the the "best matches"" ranked with he best match first. Web search engines, such as http:: //www. altavista. com/, employs various searching and scoring methods. 1.4.22 Reasonable Properties AA reasonable well-behaved and efficient data structure can be expected to fulfilll most of the following statements. A data structure is a container that stores n items of data. Each item is identified by its key(s) (algorithms typically assumes that thee keys are unqieue but in practice it is often relaxed).

Load more