Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index Motivation for Indexing

Activity Question: Why do we need indexing? Motivation for Indexing

Activity Question: Why do we need indexing?

● Items are retrieved from secondary storage to memory before processed. ● Organizing files intelligently makes the retrieval process efficient. ● Large, randomly accessed file in a computer system is associated with index ○ which like the labels on the drawers ○ directing the searcher to the small part of the file containing the desired item. Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index Operations on a file

● Files: of records k0 α0 ● Each record: ri = (ki, αi), where ki is the key k1 α1 and αi is the associated information ● Operations k2 α2 ○ Insert: add new record, (ki, αi), checking ki is unique...... ○ Delete: remove record, (ki, αi), given ki ○ Find: retrieve αi, given ki. ○ Next: retrieve αi+1, given that αi was just retrieved. B-tree: Generalization of Binary Search Tree ● More than 2 paths leave a given node. ● Compare query key and the key stored at the node the decide path to take. ● Exact match (success). No exact match and leaf is reached (failure). B-tree of Order d ● Each node contains at most 2d keys and 2d + 1 pointers. ● Each node contains at least d keys and d + 1 points (at least ½ full). Balancing

B-Tree:

● Never visits more than 1 + logd(n) node. ● Accessing each node is a separate access to secondary storage. Insertion

1. Find: proceeds from root to location the proper leaf for insertion. 2. Insert: balance is restored by a procedure which moves from the leaf back towards the root.

Insertion: Split

Of the 2d + 1 keys, the smallest d are placed in one node, the largest d are placed in another node, and the remaining value is promoted to the parent node as separator. The splitting can propagtes to root and the tree increase height by 1. Deletion

Find proper node. There are two possibilities:

1. The key to be deleted resides in a leaf 2. The key resides in a nonleaf node. a. An adjance key be found and swapped into the vacated position. b. Use the leftmost leaf in the right subtree. Deletion: Underflow

After the removal, check to see at least d keys remain in each node. If a node has less than d keys, then underflow is said to occur and redistribution of the keys becomes necessary. Deletion: Concatenation

● Redistribution of keys among two neighbors only there are at least 2d keys. ● When there are less than 2d keys remain, a concatenation must occur. ○ Keys are simply combined into one of the nodes and the other is discarded. ○ Since only one node remains, the key separating the two nodes in the ancestor is no longer necessary and added to the single remaining leaf. ○ If the descendants of the root are concatenated, they form a new root, decrease B-tree height by 1. Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index The cost of operations

● Retrieval costs ● Insertion and Deletion costs ● Sequential Processing Retrieval costs

● Find operation grows as the logarithm of the file size. ● With d being order of the B-tree, n being number of keys in the file, h being the height of the tree: Insertion and Deletion costs - Tree Height

● May require additional secondary storage accesses beyond the cost of a find operation as it progresses back up the tree. ● Overall, the costs are at most doubled, so the height of the tree still dominates the cost. ● In a B-tree of order d for a file of n records, insertion and deletion take time proportional to logd(n) in the worst case. Insertion and Deletion costs - Tree Order

● As the branch factor, d, increases, the logarithmic base increases, the cost of find, insert and delete operation delete decreases. ● There are practical limits on the size of a node. ○ Most hardware systems bound the amount of data that can be transferred with one access to secondary storage. ○ The cost estimation is now hiding the constant factor which grows as the size of data transferred increases. Sequential Processing

● Using the next operation to process all records in key-sequence order. ● B-tree may not do well in sequential processing ○ Preorder tree walk requires space for at least h = logd(n+1) nodes in main memory since it stacks the nodes along a path from the root to avoid reading them twice ○ Processing a next operation may require tracing a path through several nodes before reaching the desired key. ● B+-tree improves sequential processing performance. Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index B-Tree variants

● Different variations ○ Splitting vs. Redistributed to neighbor ○ Processing a node once it has been retrieved from secondary storage, using different search method (e.g. linear search, binary search) ○ Varying “order” at each depth ● B*-Trees ● B+-Trees B*-Trees

● Each node is at least ⅔ full (instead of just ½ full). ● Delay spitting until 2 sibling nodes are full and then divided into 3 each ⅔ full. ● Increasing storage utilization. ● Speeding up search as height of the tree is reduced. B+-Trees structure

● All keys reside in the leave. ● Nonleaf levels are organized as B-tree, consist only index. All keys reside in leaves. ● Leaf nodes are usually linked together left-to-right. B+-Tree Operations

● Insertion: ○ Almost identically to B-tree. ○ During a split, instead of promoting the middle key, promote a copy of the key. ● Deletion: ○ key to be deleted always reside in leaf node, which makes deletion simple. ○ As long as the leaf remains at least half full, the upper index levels does not need change. ● Find: ○ Search does not stop on exact match, instead the right pointer is followed. ○ Almost proceeds all the way to a leaf. Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index B-tree in Multiuser Environment

● Should permit several user requests to be processed simultaneously. ● One process may read a node and follow one of the links while another process is changing it. ● Find operations goes top down, while insertion and deletion require bottom-up access. B-tree in Multiuser Environment: Locking

● Find operation ○ locks a node once it has been read ○ Release when search proceed to next level ○ Readers locks at most two nodes at any time. ● Update operation ○ Reservation on access ○ Reservation converted to an absolute lock if update changes will propagate to the reserved node, otherwise cancel reservation ○ Reserved node may be read but may not be reserved a second time B-tree in Multiuser Environment: Security

● Protection of information in a multiuser environment. ● Memory protection mechanism of paging. ● Encryption techniques can be used to protect contents of a file outside of the underlying system. Summary of B-tree

● Efficient, simple and easily maintained. ● Logarithmic cost find insert and delete operations. ● Guarantee 50% storage utilization. ● B+-tree allow efficient sequential processing. ● There are many variants of B-tree. ● Can be used in multiuser environment. Content

● Motivation for Indexing ● B-tree ○ B-tree basics ○ The cost of B-tree operations ○ B-tree variants ○ B-tree in multi-user Environments ● Learned Index Indexes as Models

● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists Indexes as Models

● B-Tree Index : Maps key to position of record in sorted array ● Hash Index : Maps key to position of record in unsorted array ● BitMap Index : Checks if a data-record exists

Can we replace these traditional models with other kinds of models? Activity

If we have fixed length records with continuous integer keys from 1 to 1 million, can we find a better way to access record corresponding to any given key?

What if the length of each record was one unit greater than its immediate predecessor? Knowing data distribution helps !

● ML, especially neural nets, can learn variety of data distributions, mixtures and other patterns ● Balancing complexity of model with accuracy is important What should the model learn? What should the model learn?

Model that predicts position of key within a sorted array effectively approximates the Cumulative Distribution Function (CDF) corresponding to the keys

Total Position number estimate of records p = F(Key)*N

Estimated CDF Naive Learned Index

● Performance worse in comparison to traditional Btrees ! ● Might be more CPU and space-efficient to narrow down position of an item from entire dataset to region with thousands of records ● Significantly more difficult to run the last mile ● -efficiency and memory efficiency of B-trees difficult to replicate in our model Learning Index Framework

● Learns simple models on the fly and relies on TensorFlow for complex models ● Generates efficient index structures in C++ for inference ● Runs simple models in order of 30 ns Recursive Model Index

● Learning Hierarchy of models instead of a single unified model for indexing ● Each stage takes key as input and selects another better model in the next hierarchical layer ● Final stage predicts position Hybrid Indexes

● Different layers have different types of learning models ● Ideas? Hybrid Indexes

● Different layers have different types of learning models ● Ideas? ○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn Hybrid Indexes

● Different layers have different types of learning models ● Ideas? ○ Small ReLU neural net at top can learn wide range of complex data distributions ○ Bottom can have simple linear regression models - inexpensive in space and execution time ○ Traditional B-Trees (i.e. decision trees) can be used at bottom if data is particularly difficult to learn ● Worst-case performance of learned indexes bound to that of B-Trees ! Searching record with learned index

● Finding the first key higher/lower from the look-up key based on prediction ● Model biased search ○ Middle-point of binary search set to the value predicted by our model ● Biased quaternary search ○ Three middle points - pos-ힼ,pos,pos+ힼ ● Min and max-errors used to define the search area ● Can this work for non-existent keys? Indexing Strings / Training Results

● Map - longitude of locations is relatively linear and has fewer irregularities ● Weblogs - worst-case scenario, complex time patterns ● Log-Normal (synthetic) - highly non-linear, CDF difficult to learn using NNs Results

Comparison with other models

● Lookup Tables with fixed records allows use of AVX instructions ● FAST - Highly SIMD-optimized is used, but memory requirement is higher ● Fixed-Size Btree with interpolation search (variation of binary search for uniformly distributed data) - height of Btree is fixed to reduce memory consumption ● Multivariate Learned Index - multivariate linear regression used at top layer of hierarchy with variables like key, log(key), key2

Results

String Datasets

● Speed-up for learned index is not prominent due to high cost of execution and search over strings ● Higher precision in hybrid indexes helps since string search is more expensive ● Different search strategies make a difference (biased binary search vs biased quaternary search) ● Non-hybrid RMI with quaternary search performed best Point Index

● Hash-maps have been used for point look-ups ● Efficient implementations aim to reduce conflicts ● Previous learning models for hash functions didn’t consider underlying data distribution and hence the size of data-structure grew with data-size Hash-Model Index

● Learning CDF of key distribution ● We don’t aim to store keys compactly or in strictly sorted order ● Inserts, look-ups and conflict handling depends on hashmap architecture ● Benefits of learned hashmap function depend on accuracy of model in representing the CDF, hashmap architecture, etc Results

● Learned models reduced conflicts upto 77% over these datasets, learning empirical CDF at reasonable cost ● For distributed settings with RDMA for lookup, benefits of learned models can be high ● Depending on hashmap architecture, complexity of learned models may or may not pay off Existence Index

Bloom Filters

● Space-efficient probabilistic data-structure to test if an element is member of a set ● Guarantee no false negatives, but false positives possible ● In spite of being space-efficient, can still occupy significant amount of memory Learned Bloom Filters

● Given high latencies to access cold-storage, we can afford to have more complex models reducing false positives and space requirements ● What properties should a good for bloom filters have? Learned Bloom Filters

● Given high latencies to access cold-storage, we can afford to have more complex models reducing false positives and space requirements ● What properties should a good hash function for bloom filters have? ○ lots of collisions among keys ○ lots of collisions among non-keys(keys that don’t exist) ○ few collisions between keys and non-keys Learned Bloom Filters

● Given high latencies to access cold-storage, we can afford to have more complex models reducing false positives and space requirements ● What properties should a good hash function for bloom filters have? ○ lots of collisions among keys ○ lots of collisions among non-keys (keys that don’t exist) ○ few collisions between keys and non-keys ● Maintain specific FPR for realistic queries while maintaining FNR of zero ● Existence indices have traditionally not used distribution of keys to advantage, but learned bloom filters can ! ● Any ideas? Learned Bloom Filters

Bloom-filters as classification problem -

● Using neural network with sigmoid activation to produce binary probabilistic classifier ● Choosing threshold 휏 such that outputs above 휏 are assumed to exist in the database ● Such a model will have a positive FNR along with positive FPR ! Solutions? Learned Bloom Filters

Bloom-filters as classification problem -

● Using neural network with sigmoid activation to produce binary probabilistic classifier ● Choosing threshold 휏 such that outputs about 휏 are assumed to exist in the database ● Such a model will have a positive FNR along with positive FPR ! Solutions? Learned Bloom Filters

How to maintain a specific FPR p*?

FPRO= FPR휏 + (1-FPR휏)*FPRB

* * For simplicity, keep FPR휏 = FPRB = p /2 to ensure FPRO≤ p . Such a 휏 can be tuned over the held-out data-set of non-keys.

Learned model is small comparative to dataset, and overflow bloom filter scales with FNR => lower memory footprint

Bloom-filters with model hashes - learning hash function such that most of the keys are mapped to higher positions and non-keys mapped to lower bit positions => same probabilistic model can be used !

Results

Example - Normal bloom filter with FPR of 1% needs 2.04 MB. 16 dim GRU type of RNN requires 0.0259 MB. Setting 휏 = 0.5% makes FNR = 55% and spillover bloom filter requires 1.39 MB (36% reduction in size)

Additional work - Covariate shifts in query distribution, using additional features for ML models, etc Conclusions and Future Directions

● Exploring other ML models ● Multi-dimensional indexes using any combination of attributes as key ● Learned ● GPU/TPU and other hardware improvements