LSM-: An LSM-tree based Ultra-Large Key-Value Store for Small Data

Sruthi Chandupatla 1001771780

LevelDB is a tree based , whereas LSM-trie is a trie or prefix-tree based algorithm. The main goal of this algorithm to reduce the metadata significantly required for locating/reading the items from the KV-store as well as reducing the write amplification by an order of magnitude. In comparison to levelDB, the LSM-trie reduces the write amplification to 7 (in a 7 level LSM-trie with amplification factor 10) from 77 in case of 7 level LSM tree with amplification factor 10. This algorithm mainly uses SHA-1 based hashing(160 ), first 32 bits is called as prefix and are used to maintain the tree structure, last but one 32 bits is called infix and are used to store the KV item in Htable, and choose the bucket based on suffix, last 64 bits of the hash value. Balancing the tree is also done by using the infix value from the to move KV elements from one bucket to another bucket. We maintain Migration metadata for determining if the item has been migrated or not. Apart from this metadata, we also have bloom filters to find whether a Key is present in a particular block or not with some false positive rate. All these things make the LSM-trie provide a good performance in terms of Write amplification and Read amplification. Because of the prefix tree structure, it can provide an efficient way of checking using the bloom filter and this is called as cluster bloom filter. This will allow us to make at most one access at each level as the prefix is going to be the same for all the buckets in a particular sublevel of a level. Moreover the LSM-trie has no index to be stored in the memory.

Prototyped LSM-trie has Htables of 32MB and an amplification factor of 8. This store has 5 levels, with the first four levels having 8 sublevels each and the 5th level has 80 sublevels. So it comprises a total of 112 levels in it. The buckers the of 4kb size, they are designed so that each bucket can be read using a single disk read. 16 bloom filters are used and this achieves an almost 5% false positive rate. At level i, if all the sublevels are filled then the sublevels are sorted across and pushed to the sublevel at i+1th level. This make the major compaction costing only 1 write instead of exponential cost in case of levelDB. For reading a key, we just need to calculate the hash of key, traverse using prefix, checking bloom filters with a single read using the concept of clustered bloom filter, and then finally read the block directly containing the KV item.

(1) “Note that LSM-trie uses hash functions to organize its data and accordingly does not support range search.” Do FAWN and LevelDB support range search?

FAWN is hash based, It stores hash index to map 160-bit keys to a value stored in data log. Since this ia data log, the KV items are not sorted and hence range search is not possible.

LevelDB used the concept of SSTables, these SSTables stores the sorted key-value pairs. Hence it supports range based search. (2) Use Figure 1 to explain the difference between linear and exponential growth patterns.

Exponential growth pattern are illustrated here in between levels whereas linear growth patterns are illustrated as sublevels in a level.

When a sublevel is filled, we simply add a new sublevel and the growth cost is linear. Whereas when a level is filled and needs to be compacted, we need to push the data to next level and the cost is going to be exponential(amplification factor).

(3) “Because 4KB block is a disk access unit, it is not necessary to maintain a larger index to determine byte offset of each item in a block.” Show how a lookup with a given key is carried out in LevelDB?

There will be 2 memtables in the levelDB structure. We use two because when one memtable is filled, while we copy it to the disk we can use the other memtable in parallel. So we need to check both the memtables in memory before moving on into disk or tree. Step1: Since memtable is sorted, we can simply use binary search on the index to check if the key is memtable or not.

Step 2: We then check in the second memtable using binary search again

Step 3: We then check level 0. Level 0 contains various sublevels where each sublevel is sorted but the sublevels are not sorted across. So we need to check all the sublevels where the key is present in the range of the SSTable. We have index stored for the same. So we need to check such sublevels using binary search since the keys are sorted. For checking whether a key is present in the SSTable or not, we can use bloom filter as well by checking the bloom filter of each 4kb block inside the SSTable. Using bloom filters with good false positive rate will increase the read amplification significantly.

Step 4: If we find the key, we retrieve the value and return the value. Otherwise go to next step.

Step 5: Go to the next level. Since the SSTables in this level(other than level 0) are sorted across the list of tables, we check only one SSTable where the key is included in its range. For checking whether a key is present in the SSTable or not, we can use bloom filter as well by checking the bloom filter of each 4kb block inside the SSTable.

Step 6: If we find the key in it, we retrieve and return the value, else go to step 5.

See below diagram from lecture for diagrametic understanding (4) “Instead, we first apply a cryptographic hash function, such as SHA-1, on the key, and then use the hashed key, or hashkey in short, to make the determination.” Assuming a user-provided key has 160 bits, what’s the issue if LSM-trie used the user keys, instead of hashed keys, in its and operations? (or why does LSM-trie use hashkey, instead of the original user-provided key?)

Even though a user key is 160 bits, when we construct the trie, the tree could be completely unbalanced. In the worst case, we would get something like a of items instead of a tree structure. As described in the paper, when we use a cryptographic hash function, each branch of the tree has an equal chance to get a new item and the item count in each branch of the tree follows a normal distribution. So we get a balanced tree in this case. A hash function like SHA1 gives a normal distribution output which will result in a more blanched tree. So here, the hash function is more important than the 160bit output it gives.

(5) Use Figures 2 and 3 to describe the LSM-trie’s structure and how compaction is performed in the trie.

LSM-trie Structure:

LSM-trie is a prefix based algorithm based on the hash value of the key. Hash output has 3 important 3 parts : prefix, infix and suffix. Prefix is of 32 bit length from the hash value and this will be used in generating the tree structure. Assuming amplification factor is 8, we need 3 bits at each level since we get 8 branches from 3 bits( 2^3 = 8). So first 3 bits are used to move from level 0 to level 1, next 3 bits to move from level 1 to level 2 and so on. So no index is required to traverse the tree, simply we can traverse the tree and find the right bucket the key contains. Compaction:

All the sublevels in level i are merged and sorted based on their hash values (delete keys are deleted and update items are updated at this step). Since the amplification factor here is 8, we use 3 bits at a time to decide on the branch of the tree. Now the sorted array is pushed as the new sublevel at the top of level i+1 based on the next 3 bits in the . (6) “Therefore, the Bloom filter must be beefed up by using more bits.” Why do the Bloom filters have to be longer?

Bloom filters are used to reduce the disk read operations in case of retrieving a key from the disk. Before reading a disk block, we first check the bloom filter to make sure the KV item is present in that block or not. Bloom filter is a m bit vector which is based upon on k hash functions. Bloom filter a false-positive rate but doesn’t have false-negative rate i.e; if the key is actually present in a block, the bloom filters never predict it is not present(false-negative). But if the key is actually not present in the block, it can predict that the key is present in the block(false-positive).

Let us assume bloom filter is m bit vector, k is the number of hash functions and n is the number of keys we want to store. So the false positive probablity is

So false positive rate decreases exponentially with linear increase with m (i.e size of bloom filter). So we might need longer bloom filters in case of LSM-trie for better prediction with low false positive probability. Compared to the disk size, the additional on-disk space for the larger bloom filters is minimal. When compared to that of 7 level levelDB, In a 7 level LSM trie we need much longer bloom filter because each level in a LSM-trie has 8 sublevels where as in case of levelDB only level0 hash sublevels. Thus because of large increase in number of sublevels in case of LSM trie, we need longer bloom filters.

As described in the paper, in case of example with a LSM-trie of 112 sublevels with 5 levels, if we use a 10 bits bloom filter per key, we actually eget almost nearly 100% false positive rate. And this makes the prupose of bloom filter useless completely. When we use a 16-bit bloom filter per key, we get only 5% false positive rate. See the below table for more comparisons. (7) “However, a challenging issue is whether the buckets can be load balanced in terms of aggregate size of KV items hashed into them” Why may the buckets in an HTable be load unbalanced?

As we already learnt that the keys are assigned to a bucket in Htable using a hash function. A cryptographic hash function allows each bucket to have almost equal chance for each item we put in but item count in a bucket follows a normal distribution. So, the number of items in the bucket varies significantly, So the buckets in the Htable are load unbalanced.