Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters Monday, August 14, 2017
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Skip Lists: a Probabilistic Alternative to Balanced TreesSkip Lists: A Probabilistic Alternative to Balanced Trees Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees. William Pugh Binary trees can be used for representing abstract data types Also giving every fourth node a pointer four ahead (Figure such as dictionaries and ordered lists. They work well when 1c) requires that no more than n/4 + 2 nodes be examined. the elements are inserted in a random order. Some sequences If every (2i)th node has a pointer 2i nodes ahead (Figure 1d), of operations, such as inserting the elements in order, produce the number of nodes that must be examined can be reduced to degenerate data structures that give very poor performance. If log2 n while only doubling the number of pointers. This it were possible to randomly permute the list of items to be in- data structure could be used for fast searching, but insertion serted, trees would work well with high probability for any in- and deletion would be impractical. put sequence. In most cases queries must be answered on-line, A node that has k forward pointers is called a level k node. so randomly permuting the input is impractical. Balanced tree If every (2i)th node has a pointer 2i nodes ahead, then levels algorithms re-arrange the tree as operations are performed to of nodes are distributed in a simple pattern: 50% are level 1, maintain certain balance conditions and assure good perfor- 25% are level 2, 12.5% are level 3 and so on.
- 
												  Implementing Signatures for Transactional MemoryAppears in the Proceedings of the 40th Annual IEEE/ACM Symposium on Microarchitecture (MICRO-40), 2007 Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark D. Hill, Karthikeyan Sankaralingam Department of Computer Sciences, University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet/logtm {daniel, lyen, markhill, karu}@cs.wisc.edu Abstract An important aspect of conflict detection is recording the addresses that a transaction reads (read set) and writes Transactional Memory (TM) systems must track the (write set) at some granularity (e.g., memory block or read and write sets—items read and written during a word). One promising approach is to use signatures, data transaction—to detect conflicts among concurrent trans- structures that can represent an unbounded number of el- actions. Several TMs use signatures, which summarize ements approximately in a bounded amount of state. Led unbounded read/write sets in bounded hardware at a per- by Bulk [7], several systems including LogTM-SE [31], formance cost of false positives (conflicts detected when BulkSC [8], and SigTM [17], have implemented read/write none exists). sets with per-thread hardware signatures built with Bloom This paper examines different organizations to achieve filters [2]. These systems track the addresses read/written hardware-efficient and accurate TM signatures. First, we in a transaction by inserting them into the read/write sig- find that implementing each signature with a single k-hash- natures, and clear both signatures as the transaction com- function Bloom filter (True Bloom signature) is inefficient, mits or aborts. Depending on the system, signatures must as it requires multi-ported SRAMs.
- 
												  Skip B-TreesSkip B-Trees Ittai Abraham1, James Aspnes2?, and Jian Yuan3 1 The Institute of Computer Science, The Hebrew University of Jerusalem, [email protected] 2 Department of Computer Science, Yale University, [email protected] 3 Google, [email protected] Abstract. We describe a new data structure, the Skip B-Tree that combines the advantages of skip graphs with features of traditional B-trees. A skip B-Tree provides efficient search, insertion and deletion operations. The data structure is highly fault tolerant even to adversarial failures, and allows for particularly simple repair mechanisms. Related resource keys are kept in blocks near each other enabling efficient range queries. Using this data structure, we describe a new distributed peer-to-peer network, the Distributed Skip B-Tree. Given m data items stored in a system with n nodes, the network allows to perform a range search operation for r consecutive keys that costs only O(logb m + r/b) where b = Θ(m/n). In addition, our distributed Skip B-tree search network has provable polylogarithmic costs for all its other basic operations like insert, delete, and node join. To the best of our knowledge, all previous distributed search networks either provide a range search operation whose cost is worse than ours or may require a linear cost for some basic operation like insert, delete, and node join. ? Supported in part by NSF grants CCR-0098078, CNS-0305258, and CNS-0435201. 1 Introduction Peer-to-peer systems provide a decentralized way to share resources among machines. An ideal peer-to-peer network should have such properties as decentralization, scalability, fault-tolerance, self-stabilization, load- balancing, dynamic addition and deletion of nodes, efficient query searching and exploiting spatial as well as temporal locality in searches.
- 
												  On the Cost of Persistence and Authentication in Skip Lists*On the Cost of Persistence and Authentication in Skip Lists⋆ Micheal T. Goodrich1, Charalampos Papamanthou2, and Roberto Tamassia2 1 Department of Computer Science, University of California, Irvine 2 Department of Computer Science, Brown University Abstract. We present an extensive experimental study of authenticated data structures for dictionaries and maps implemented with skip lists. We consider realizations of these data structures that allow us to study the performance overhead of authentication and persistence. We explore various design decisions and analyze the impact of garbage collection and virtual memory paging, as well. Our empirical study confirms the effi- ciency of authenticated skip lists and offers guidelines for incorporating them in various applications. 1 Introduction A proven paradigm from distributed computing is that of using a large collec- tion of widely distributed computers to respond to queries from geographically dispersed users. This approach forms the foundation, for example, of the DNS system. Of course, we can abstract the main query and update tasks of such systems as simple data structures, such as distributed versions of dictionar- ies and maps, and easily characterize their asymptotic performance (with most operations running in logarithmic time). There are a number of interesting im- plementation issues concerning practical systems that use such distributed data structures, however, including the additional features that such structures should provide. For instance, a feature that can be useful in a number of real-world ap- plications is that distributed query responders provide authenticated responses, that is, answers that are provably trustworthy. An authenticated response in- cludes both an answer (for example, a yes/no answer to the query “is item x a member of the set S?”) and a proof of this answer, equivalent to a digital signature from the data source.
- 
												  Traits: Experience with a Language Feature7UDLWV([SHULHQFHZLWKD/DQJXDJH)HDWXUH (PHUVRQ50XUSK\+LOO $QGUHZ3%ODFN 7KH(YHUJUHHQ6WDWH&ROOHJH 2*,6FKRRORI6FLHQFH1(QJLQHHULQJ$ (YHUJUHHQ3DUNZD\1: 2UHJRQ+HDOWKDQG6FLHQFH8QLYHUVLW\ 2O\PSLD$:$ 1::DONHU5G PXUHPH#HYHUJUHHQHGX %HDYHUWRQ$25 EODFN#FVHRJLHGX ABSTRACT the desired semantics of that method changes, or if a bug is This paper reports our experiences using traits, collections of found, the programmer must track down and fix every copy. By pure methods designed to promote reuse and understandability reusing a method, behavior can be defined and maintained in in object-oriented programs. Traits had previously been used to one place. refactor the Smalltalk collection hierarchy, but only by the crea- tors of traits themselves. This experience report represents the In object-oriented programming, inheritance is the normal way first independent test of these language features. Murphy-Hill of reusing methods—classes inherit methods from other classes. implemented a substantial multi-class data structure called ropes Single inheritance is the most basic and most widespread type of that makes significant use of traits. We found that traits im- inheritance. It allows methods to be shared among classes in an proved understandability and reduced the number of methods elegant and efficient way, but does not always allow for maxi- that needed to be written by 46%. mum reuse. Consider a small example. In Squeak [7], a dialect of Smalltalk, Categories and Subject Descriptors the class &ROOHFWLRQ is the superclass of all the classes that $UUD\ +HDS D.2.3 [Programming Languages]: Coding Tools and Tech- implement collection data structures, including , , 6HW niques - object-oriented programming and . The property of being empty is common to many ob- jects—it simply requires that the object have a size method, and D.3.3 [Programming Languages]: Language Constructs and that the method returns zero.
- 
												  Fast Candidate Generation for Real-Time Tweet Search with Bloom Filter ChainsFast Candidate Generation for Real-Time Tweet Search with Bloom Filter Chains NIMA ASADI and JIMMY LIN, University of Maryland at College Park The rise of social media and other forms of user-generated content have created the demand for real-time search: against a high-velocity stream of incoming documents, users desire a list of relevant results at the time the query is issued. In the context of real-time search on tweets, this work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer 13 to produce the final output. We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically-increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWAND is able to perform both conjunctive and disjunctive retrieval. Experiments show that our algorithm is many times faster than competitive baselines and that this increased performance does not require sacrificing end-to-end effectiveness. Our results empirically characterize the trade-off space defined by output quality, query evaluation speed, and memory footprint for this particular search architecture. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Scalability, efficiency, top-k retrieval, tweet search, bloom filters ACM Reference Format: Asadi, N. and Lin, J. 2013. Fast candidate generation for real-time tweet search with bloom filter chains.
- 
												  Exploring the Duality Between Skip Lists and Binary Search TreesExploring the Duality Between Skip Lists and Binary Search Trees Brian C. Dean Zachary H. Jones School of Computing School of Computing Clemson University Clemson University Clemson, SC Clemson, SC [email protected] [email protected] ABSTRACT statically optimal [5] skip lists have already been indepen- Although skip lists were introduced as an alternative to bal- dently derived in the literature). anced binary search trees (BSTs), we show that the skip That the skip list can be interpreted as a type of randomly- list can be interpreted as a type of randomly-balanced BST balanced tree is not particularly surprising, and this has whose simplicity and elegance is arguably on par with that certainly not escaped the attention of other authors [2, 10, of today’s most popular BST balancing mechanisms. In this 9, 8]. However, essentially every tree interpretation of the paper, we provide a clear, concise description and analysis skip list in the literature seems to focus entirely on casting of the “BST” interpretation of the skip list, and compare the skip list as a randomly-balanced multiway branching it to similar randomized BST balancing mechanisms. In tree (e.g., a randomized B-tree [4]). Messeguer [8] calls this addition, we show that any rotation-based BST balancing structure the skip tree. Since there are several well-known mechanism can be implemented in a simple fashion using a ways to represent a multiway branching search tree as a BST skip list. (e.g., replace each multiway branching node with a minia- ture balanced BST, or replace (first child, next sibling) with (left child, right child) pointers), it is clear that the skip 1.
- 
												 Space](https://docslib.b-cdn.net/cover/8458/investigation-of-a-dynamic-kd-search-skip-list-requiring-theta-kn-space-338458.webp)  Investigation of a Dynamic Kd Search Skip List Requiring [Theta](Kn) SpaceHcqulslmns ana nqulstwns et Bibliographie Services services bibliographiques 395 Wellington Street 395, rue Wellington Ottawa ON KIA ON4 Ottawa ON KIA ON4 Canada Canada Your file Votre réference Our file Notre reldrence The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant a la National Libraq of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microforni, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othexwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. A new dynamic k-d data structure (supporting insertion and deletion) labeled the k-d Search Skip List is developed and analyzed. The time for k-d semi-infinite range search on n k-d data points without space restriction is shown to be SZ(k1og n) and O(kn + kt), and there is a space-time trade-off represented as (log T' + log log n) = Sà(n(1og n)"e), where 0 = 1 for k =2, 0 = 2 for k 23, and T' is the scaled query time. The dynamic update tirne for the k-d Search Skip List is O(k1og a).
- 
												  An Optimal Bloom Filter Replacement∗An Optimal Bloom Filter Replacement∗ Anna Pagh† Rasmus Pagh† S. Srinivasa Rao‡ Abstract store the approximation is roughly n log(1/)1, where This paper considers space-efficient data structures for n = |S| and the logarithm has base 2 [5]. In contrast, 0 0 the amount of space for storing S ⊆ {0, 1}w explicitly storing an approximation S to a set S such that S ⊆ S w 0 2 and any element not in S belongs to S with probability is at least log n ≥ n(w − log n) bits, which may be at most . The Bloom filter data structure, solving this much larger if w is large. Here, we consider subsets problem, has found widespread use. Our main result is of the set of w-bit machine words on a standard RAM a new RAM data structure that improves Bloom filters model, since this is the usual framework in which RAM in several ways: dictionaries are studied. Let the (random) set S0 ⊇ S consist of the elements • The time for looking up an element in S0 is O(1), that are stored (including false positives). We want S0 independent of . to be chosen such that any element not in S belongs to S0 with probability at most . For ease of exposition, we • The space usage is within a lower order term of the assume that ≤ 1/2 is an integer power of 2. A Bloom lower bound. filter [1] is an elegant data structure for selecting and representing a suitable set S0. It works by storing, as a • The data structure uses explicit hash function bit vector, the set families.
- 
												  Comparison of Dictionary Data StructuresA Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A ! B is a function taking as input a member a 2 A, and returning b, an element of B. A mapping is also sometimes referred to as a dictionary, because dictionaries map words to their definitions. Knuth [?] explores the map / dictionary problem in Volume 3, Chapter 6 of his book The Art of Computer Programming. He calls it the problem of 'searching,' and presents several solutions. This paper explores implementations of several different solutions to the map / dictionary problem: hash tables, Red-Black Trees, AVL Trees, and Skip Lists. This paper is inspired by the author's experience in industry, where a dictionary structure was often needed, but the natural C# hash table-implemented dictionary was taking up too much space in memory. The goal of this paper is to determine what data structure gives the best performance, in terms of both memory and processing time. AVL and Red-Black Trees were chosen because Pfaff [?] has shown that they are the ideal balanced trees to use. Pfaff did not compare hash tables, however. Also considered for this project were Splay Trees [?]. 2 Background 2.1 The Dictionary Problem A dictionary is a mapping between two sets of items, K, and V . It must support the following operations: 1. Insert an item v for a given key k. If key k already exists in the dictionary, its item is updated to be v.
- 
												  Comparison on Search Failure Between Hash Tables and a Functional Bloom Filterapplied sciences Article Comparison on Search Failure between Hash Tables and a Functional Bloom Filter Hayoung Byun and Hyesook Lim * Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul 03760, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-2-3277-3403 Received:17 June 2020; Accepted: 27 July 2020; Published: 29 July 2020 Abstract: Hash-based data structures have been widely used in many applications. An intrinsic problem of hashing is collision, in which two or more elements are hashed to the same value. If a hash table is heavily loaded, more collisions would occur. Elements that could not be stored in a hash table because of the collision cause search failures. Many variant structures have been studied to reduce the number of collisions, but none of the structures completely solves the collision problem. In this paper, we claim that a functional Bloom filter (FBF) provides a lower search failure rate than hash tables, when a hash table is heavily loaded. In other words, a hash table can be replaced with an FBF because the FBF is more effective than hash tables in the search failure rate in storing a large amount of data to a limited size of memory. While hash tables require to store each input key in addition to its return value, a functional Bloom filter stores return values without input keys, because different index combinations according to each input key can be used to identify the input key. In search failure rates, we theoretically compare the FBF with hash-based data structures, such as multi-hash table, cuckoo hash table, and d-left hash table.
- 
												  Invertible Bloom Lookup TablesInvertible Bloom Lookup Tables Michael T. Goodrich Dept. of Computer Science University of California, Irvine http://www.ics.uci.edu/˜goodrich/ Michael Mitzenmacher Dept. of Computer Science Harvard University http://www.eecs.harvard.edu/˜michaelm/ Abstract We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of the pairs it contains with high probability, as long the number of key- value pairs is below a designed threshold. Our structure allows the number of key- value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If entries are later deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight. 1 Introduction The Bloom filter data structure [5] is a well-known way of probabilistically supporting arXiv:1101.2245v3 [cs.DS] 4 Oct 2015 dynamic set membership queries that has been used in a multitude of applications (e.g., see [8]). The key feature of a standard Bloom filter is the way it trades off query accuracy for space efficiency, by using a binary array T (initially all zeroes) and k random hash functions, h1; : : : ; hk, to represent a set S by assigning T [hi(x)] = 1 for each x 2 S.