Range Queries Over Hashing

Range Queries over Hashing Clarissa Bruno Tuxen Rayane de Araujo Ioan Raicu Fluminense Federal University Fluminense Federal Institute Illinois Institute of Technology [email protected] [email protected] [email protected] ABSTRACT of nodes. ZHT tries to merge the advantages of DHTs and The present work proposes the investigation of the most ef- efficiency in terms of latency and throughput. However, ficient way for implementing range queries over hashing in currently, it does not provide support for performing range a Distributed Hash Table (DHT) environment. Currently, queries, making it susceptible to face the issues presented some DHTs do support such queries, however in most cases above. it is done in an inefficient way. One of the used approaches Based on the presented problems, this project has the goal uses brute force by individually querying each discrete value of discovering an efficient approach to support range queries within the range, which is not practical and could cause a in a ZHT-like environment. As previously discussed, the lack of performance when compared to systems that could current range query approach that some DHT systems im- do it in an efficacious way. Oppositely, there are data struc- plement is infeasible in most cases [2]. Therefore, it is the tures that work well for range searches due to their internal intention of this research to find an approach which avoids properties, such as Red-Black trees, however they do not querying each value within the range, using a structure that provide some desired DHT functionalities. When working provides efficient range search over hashing, since the hash- with fast-changing attributes, and continuous values, it being properties are desirable, aiming at being more efficient. comes even harder to handle the system's performance in Implementing this approach, would lead ZHT to a faster DHT environments, that intend to uniformly distribute the and more efficient solution regarding this task. Once the keys using hashing. In this scenario, this research intends to ZHT storage system is based on hashing, studying how to find a suitable approach to fulfill this need as well as to re- deal with range queries over hashing becomes an important produce it and evaluate its performance comparing with the aspect to implement such queries in this context. approaches that favor both sides, the range query and the A DHT data structure is a system that allows data to be load balancing, so that those systems would achieve better distributed across nodes and easily retrieved in a scalable performance in terms of number of lookups and time. manner [5]. It can be seen as a large scale multi-computing hash table. More specifically, a hash table is a data structure that stores key-value pairs distributedly across the ta- Keywords ble. It does so by applying a hash function to the unique DHT; PHT; range query; hashing key in order to define in which entry of the table that pair will be stored. Hash tables provide O(1) inserts, deletes and lookups of keys, since they all require to first hash the 1. INTRODUCTION AND BACKGROUND key and then apply the desired operation. Considering the DHTs systems do not properly support range queries them- hash function to be efficient, this structure offers uniform selves. A simple but non optimum manner of performing distribution of objects in the storing space. However, if it is this kind of queries consists in individually querying each intended to retrieve sequential values in a range, such struc- discrete value within a range. This type of approach is not ture is inefficient due to the distribution of objects. The practical and could cause performance issues that would, proximity of similar items in the domain can not be explored consequently, impede the system of working as expected. in the hashed key space since the objects were uniformly dis- When working with fast-changing attributes, and continu- tributed across the storing space and this is a good property ous values, it becomes even harder to handle the system to be kept once it provides load balancing. performance in this scenario, specially regarding DHTs that In DHTs, instead of different entries there are different intend to uniformly distribute the keys using hashing [13]. nodes and it becomes clearer why a good distribution is de- In addition to that, the range searches would have a bet- sired, since it provides load balancing of keys across nodes ter response time, if each discrete value from the range is and decreases overheads, avoiding bottlenecks. Even though not being queried separately specially if the range is sparse, this data structure is widely used as a building block for sev- which means that there are only a few keys within the range eral distributed applications, it is limited to only efficiently and most of the searches would not return objects. support exact match queries for the aforementioned reasons. Zero-Hop Distributed Hash Table (ZHT) is a DHT-based Range queries intend to retrieve all object within a certain data structure which has been adjusted to handle High-End range of keys, making an efficient solution for this problem Computing [11]. Based on a persistent NoSQL storage sys- non-trivial in DHTs. tem [7], it aims at providing an available, fault tolerant, ef- As for data structures that support range queries, Red- ficient structure for systems that work with a great amount Figure 1: PHT representation. Each PHT node is one object to be stored in the DHT. The DHT key is the PHT node lable, and the DHT value is the PHT node itself. Black trees provide O(log n) insert, delete, and lookup of Hash tables use their hash function to uniformly distributes keys [6]. The tree has five properties that may require the keys among the storing space, while Red-Black trees are some maintenance when inserting or removing elements but balanced, with their properties assuring it. it still keeps the complexity mentioned above where n is Given the usability of DHTs, their mapping to hash tables, the amount of elements stored in the data structure. When and the desired properties found in hash tables and Red- comparing Red-Black trees and hash tables, these operations Black trees, this work intends to study a data structure that differ in two aspects that concern to this work. The former can provide efficient range querying without loosing the good searches their objects sequentially starting from the leftmost performance for the basic operations. leaf until the rightmost leaf is reached. So, it can retrieve all elements within a given range regardless of the sparsity of 2. PROPOSED SOLUTION the values. It means that for a given range, if there are few As mentioned in Section 1, the intent of this work is to re- objects stored that belong to this range, it does not affect search the possibility of efficiently implementing the lookup the tree traversal, since the tree stores the values and during of range queries in a hashed structure in order to main- the search the nodes are visited and these values are checked tain the load balancing property adding the support for such against the range. The same cannot be guaranteed for hash queries. tables. As mentioned before, hash tables do not map the The initial goal consisted of searching and analysing exist- proximity of similar items in the key domain to the hashed ing solutions, so that, based on them, improved approaches domain, so the range search needs to individually lookup could be discussed and a better solution could be proposed. each possible key in the range. This adds constraints to the However, due to time constraints the work was focused on key domain, that needs to allow discretization as well as the implementing one existing approach as well as evaluating its additional overhead that may change with the range density. performance and comparing it against both structures (hash The second aspect that differs in both structures is the op- tables and Red-Black trees) that perform well for each case erations time complexity. While the hash table provides con- mentioned in Section 1, with the nature of the work being a stant time for the basic operations (insert/delete/lookup), real system development. the tree's complexity depends on the amount of elements Since DHT can be seen as a multi-computing hash table, to be stored. This could lead to a potential scalability is- as mentioned in the Section 1, a single-node implementa- sue. However, for the range queries, hash tables' complexity tion of a system using PHT over a hash table was evaluated varies according to the size of the range, while the tree's in order to simulate such approach. Once PHT is not an complexity varies according to the range density. open source project, the approach was implemented follow- In this work, range density is based on the amount of ing the properties described in [13]. It was implemented elements contained in the range. It is related to the amount using Java. The HashMap and TreeMap Java classes were of hits and misses that occur when querying a given range. used to represent the hash table and the Red-Black tree, As an example, in the domain [1, 3, 5, 10, 11, 12, 13, 14, respectively. The Pair class in the javafx:util:P air is the 15], the density of the range [1, 5] is three; while the density only dependency outside of the java:util package. of [10, 15] is six. Thus, the second range is denser than the Among the approaches found, PHT [13] was selected, be- first one, since there were more hits.

Load more