DATA STRUCTURES OF BIG DATA: HOW THEY SCALE

Dibyendu Bhattacharya Manidipa Mitra Principal Technologist, EMC Principal Software Engineer, EMC [email protected] [email protected]

Table of Contents Introduction ...... 3

Hadoop: Optimization for Local Storage Performance...... 3

Kafka: Scaling Distributed Logs using OS Page Cache ...... 10

MongoDB: Memory Mapped File to Store B- Index and Data Files ...... 15

HBase: Log Structured Merged Tree – Write Optimized B-Tree ...... 19

Storm: Efficient Lineage Tracking for Guaranteed Message Processing ...... 25

Conclusion ...... 33

Disclaimer: The views, processes, or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2014 EMC Proven Professional Knowledge Sharing 2

Introduction We are living in the data age. Big data—information of extreme volume, diversity, and complexity—is everywhere. Enterprises, organizations, and institutions are beginning to recognize that this huge volume of data can potentially deliver high value to their business. The explosion of data has led to significant innovations on various technologies, all revolving around how this huge volume of data can be captured, stored, and processed to extract meaningful insights that will help to them make better decisions much faster, perform predictions of various outcomes, and so on.

The Big Data technology revolution can be broadly categorized into following areas.

 Technologies around Batch Processing of Big Data (Hadoop, Hive, Pig, Shark, etc.)  Technologies around Big Data Messaging infrastructure (Kafka)  Big Data Database: NoSQL Technologies (HBase, Cassandra, MongoDB, etc.)  Technologies around Real Time processing of Big Fast Data (Storm, Spark, etc.)  Big Data Search Technologies (ElasticSearch, SolrCloud, etc.)  Massively Parallel Processing (MPP) Technologies (HAWQ, Impala, Drill, etc.)

For each of these broad categories of Big Data technologies, various products or solutions have either already matured or started evolving. In this article, we discuss a few of the most popular and prominent Open Source technologies and try to explain how efficiently each solution has applied various Data Structures concepts and used Computer Science principals to solve very complex problems. This article will highlight one of the key Big Data challenges each of these solutions was required to address and how they tackle those challenges using fundamentals of data structures and operating system concepts. This article will help the reader understand the major scalability concerns that big data brings to the table and how to build a system designed to scale.

Let us first look at Hadoop, the most popular Big Data batch processing system, and see how it is optimized for local storage performance.

Hadoop: Optimization for Local Storage Performance Hadoop has become synonymous with big data processing technologies. The most sought-after technology for distributed parallel computing, Hadoop solves a specific problem of big data spectrum. Its highly scalable distributed fault tolerant big data processing system is designed to run on commodity hardware. Hadoop consists of two major components:

2014 EMC Proven Professional Knowledge Sharing 3

1. Hadoop Distributed File System (HDFS), a fault tolerant, highly consistent distributed file system 2. Map Reduce engine which is basically the parallel programming paradigm on top of distributed file system

The Hadoop framework is highly optimized for batch processing of huge amounts of data, where the large files are stored across multiple machines. HDFS is implemented as a user-level file system in Java which exploits the native Filesystem on each node, such as ext3 or NTFS, to store data. Files in HDFS are divided into large blocks, typically 64MB, and each block is stored as a separate file in the local file system.

HDFS is implemented by two services, NameNode and DataNode. NameNode is the master daemon for managing the Filesystem metadata, and DataNode, the slave daemon, actually stores the data blocks. MapReduce engine is implemented by two services called JobTracker and TaskTracker where JobTracker is the MapReduce master daemon which schedules and monitors distributed jobs and TaskTracker is the slave daemon which actually performs the specific job tasks.

Hadoop MapReduce applications use storage in a manner that differs from general-purpose computing. First, the data files accessed are large, typically tens to hundreds of gigabytes in size. Second, these files are manipulated with streaming access patterns typical of batch- processing workloads. When reading files, large data segments are retrieved per operation, with successive requests from the same client iterating through a file region sequentially. Similarly, files are also written in a sequential manner. This emphasis on streaming workloads is evident in the design of HDFS. A simple coherence model (write-once, read-many) is used that does not allow data to be modified once written (the latest version of Hadoop does support file appends, but it is a complex design which is beyond scope of this article). This is well suited to the streaming access pattern of target applications, and improves cluster scaling by simplifying synchronization requirements. Hadoop MapReduce programming model follows the Shared Nothing architecture and individual Map or Reduce tasks does not share any data structures and thus avoids the synchronization and locking issues. Each file in HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand. Manipulating data at this

2014 EMC Proven Professional Knowledge Sharing 4 granularity is efficient because streaming-style applications are likely to read or write the entire block before moving on to the next.1

HDFS is a user-level file system running on top of an operating system-level file system (e.g. ext3, ext4 file system on UNIX). Any read or write to HDFS uses the underlying operating system’s support for writing and reading from raw disk. The performance of raw disk is much better for linear read and write while random read and write disk performance is very poor due to seek time overhead. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind caching techniques that pre-fetch data in large block multiples and group smaller logical writes into large physical writes. We will discuss these two caching techniques in detail when we explain how Kafka (the real-time distributed messaging platform) uses operating system internal caching to scale. But does Hadoop really need the support for caching technique of Operating System? The answer is No. The Operating System cache is an overhead for Hadoop as sequential access pattern of MapReduce application which have minimum locality that can be exploited by cache. Hadoop would perform better if it can bypass OS Cache completely. This is a complex thing to do as Hadoop HDFS is written in Java, and Java I/O does not support bypass of OS Caching.

Hadoop performance is impacted most when data is not accessed sequentially which may happen due to poor disk scheduling algorithm or when data gets fragmented on disk during write.

Various studies on Raw Disk performance show that the read and write can reach its peak bandwidth when sequential run length (Sequential scan of data till random seek happened) is 32 MB. Thus, keeping a HDFS block size of 64MB is very reasonable.

1 http://www.cs.rice.edu/CS/Architecture/docs/phdthesis-shafer.pdf

2014 EMC Proven Professional Knowledge Sharing 5

However, optimized I/O Bandwidth during write and read may not always happen. Data may get fragmented on disk leading to poor read performance or the operating system scheduling algorithm can cause write bandwidth to decrease which also leads to fragmentation. Let us examine these two points of disk scheduling and fragmentation of disk in detail and see how Hadoop has solved this problem.

HDFS 0.20.x performance degrades whenever the disk is shared between concurrent multiple writers or readers. Excessive disk seeks occur that are counter-productive to the goal of maximizing overall disk bandwidth. This is a fundamental problem that affects HDFS running on all platforms. Existing Operating System I/O schedulers are designed for general-purpose workloads and attempt to share resources fairly between competing processes. In such workloads, storage latency is of equal importance to storage bandwidth; thus, fine-grained fairness is provided at a small granularity (a few hundred kilobytes or less). In contrast, MapReduce applications are almost entirely latency insensitive, and thus should be scheduled to maximize disk bandwidth by handling requests at a large granularity (dozens of megabytes or more). It is found during testing that in Hadoop 20.x version, aggregate bandwidth drops drastically when moving from 1 writer to 2 concurrent writers and drops further when more writers are added.

This performance degradation occurs because the number of seeks increases when the number of writers increases because disk is forced to move (by I/O scheduler) between distinct data streams. Because of these seeks, the average sequential run length decreases dramatically. In addition to poor I/O scheduling, HDFS also suffers from disk fragmentation when sharing a disk between multiple writers. The maximum possible file contiguity—the size of an HDFS block—is not preserved by the general-purpose file system when making disk allocation decisions.

Similar performance degradation occurs when the number of readers increases. Disk scheduling between multiple read operations degrades the overall read bandwidth. Also, as fragmentation increases (due to disk scheduling during writing), the sequential run length decreases and the amount of disk seeks will increase and the read bandwidth will decrease.

The diagram2 below shows the performance impact of different Filesystems when the number of readers and writers increases and also the impact on fragmentation on Hadoop version 20.x.

2 Fig 3.9 http://www.cs.rice.edu/CS/Architecture/docs/phdthesis-shafer.pdf

2014 EMC Proven Professional Knowledge Sharing 6

In this diagram, concurrent writes on Linux exhibited better performance characteristics than FreeBSD. The ext4 file system showed 8% degradation moving between 1 and 4 concurrent writers, while the XFS file system showed no degradation. In contrast, HDFS on Linux had worse performance for concurrent reads than FreeBSD. The ext4 file system degraded by 42% moving from 1 to 4 concurrent readers, and XFS degraded by 43%. Finally, fragmentation was reduced on Linux, as the ext4 file system degraded by 8% and the XFS file system by 6% when a single reader accessed files created by 1 to 4 concurrent writers.

How does the latest version of Hadoop solve this problem of scheduling and fragmentation? The key is to make HDFS smarter and present the request to the Operating System in the order it wants to process. Let us elaborate on this point a bit. The fundamental problem of disk scheduling and fragmentation is caused as OS I/O scheduler finds there are multiple writers/readers trying to issue write/read calls and schedulers applying scheduling algorithms on these processes.

2014 EMC Proven Professional Knowledge Sharing 7

In the diagram above there are four clients issuing write/read call to the Operating System. The earlier version of Hadoop spawns four threads for every client. The Operating System finds there are four processes trying to access the Disk, and it applies scheduling (i.e. Round Robin or Time Sharing) between the four processes, which leads to poor I/O bandwidth and disk fragmentation.

The diagram below shows how latest version of Hadoop solves this problem.

Client Client Client Client

Buffer

Thread

Operating System

Rotating Disk

In the diagram above, four clients try to access the disk, but now HDFS buffers the request and schedules them to disk at any specified granularity (say 64 MB) using a single thread. Since now only a single thread per disk is trying to write/read the buffer, from an operating system

2014 EMC Proven Professional Knowledge Sharing 8 point of view, it is just one process trying to access the disk. Thus, OS does not perform any expensive scheduling overhead leading to less fragmentation and hence, higher I/O bandwidth.

There are various other optimizations are done in HDFS to make it perform better. In this article, we discuss one issue around disk fragmentation and scheduling. We also touch upon the operating system caching mechanism which is an overhead for MapReduce type of access pattern. Various studies have found that if Hadoop would perform better if it can bypass OS Cache. In some cases, if Hadoop can bypass the complete OS Filesystem and directly access raw disk it can overcome all the OS overhead. The complexity and challenges of this type of approach can be a different topic of conversation entirely.

Even though Hadoop is not able to get the best out of OS Cache layer, there is another Big Data infrastructure called Kafka (https://kafka.apache.org/) which greatly benefited from the OS Caching support to scale. Kafka is an open source distributed messaging system developed by LinkedIn which must scale their messaging infrastructure to process hundreds of thousands messages per second. As any existing traditional messaging system not able to solve this kind of high volume requirement, LinkedIn had developed Kafka which can scale to such a high write and read throughput. Kafka is designed to rely heavily on OS Cache mechanism to scale. Let’s explore how they made Kafka scale.

2014 EMC Proven Professional Knowledge Sharing 9

Kafka: Scaling Distributed Logs using OS Page Cache Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.3 The heart of the Kafka design is the write-ahead log, or transaction log in the traditional database world. Logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. It looks like the diagram to the left. Records are appended to the end of the log, and reads happens from left-to-right. Each entry is assigned a unique sequential log entry number.4

Logs are used in database systems to keep the transaction durable and atomic. A database uses logs to write out information about the records they will be modifying before applying the changes to all the other data structures (i.e. table, index, etc.). Since logs are immediately persisted, they are used as the source for restoring all other persistence structure in case of a crash. Over time, log use grew from an implementation detail of ACID to a method for replicating data between databases. It was found that the sequence of changes that happened on the database is exactly what was needed to keep a remote replica database in sync. A log solves two major problems in a distributed system.

1. Ordering changes by appending the log entries strictly in time stamped order 2. Distributing data to maintain consistent replications

3 http://kafka.apache.org/documentation.html#persistence 4 http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know- about-real-time-datas-unifying

2014 EMC Proven Professional Knowledge Sharing 10

This Log can be used to handle data flow between the systems. The concept here is all data sources can be modeled as a log stream which logs out every event. Every subscription system for a given source reads from this log stream as and when it can. Since the log concept gives a logical clock for every change, any subscription system can maintain a "point of time" reference from where it can consume the data.

Additionally, the Log acts as a buffer that makes data production and data consumption work in asynchronous manner. Thus, one can think of a Log as acting as a kind of messaging system with durability guarantee and strong ordering semantics.

Let's look at a high-level abstraction of Kafka which uses this kind of Logs data structure.

The primary abstraction of Kafka is called a Topic. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this5 diagram. Here a given topic has 3 partitions.

Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log. The messages in the partitions are each assigned a sequential ID number (the offset) that uniquely identifies each message within the partition. The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers

5 http://kafka.apache.org/documentation.html

2014 EMC Proven Professional Knowledge Sharing 11 for fault tolerance. Kafka brokers maintain the topic partitions and guarantee the replication consistency.

Kafka Producers publish data to the topics of their choice. The producer is able to choose which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (i.e. based on some key in the message). Finally, Kafka Consumer can consume messages from a given topic using either traditional queuing or publish-subscribe mechanism. Of note here is that unlike other traditional Messaging Brokers, Kafka broker does not maintain any message state which clearly is an overhead for a big data system to scale. On the other hand, it is the duty of Kafka Consumers to keep the message consumption state and maintain the message offsets. Having a sequential ordering of storage layout, it is easy for Consumer to consume messages from any given offset.

However, having a commit log that acts as a multi subscriber real-time journal of all events happening on a large scale system makes scalability a primary challenge. Let us see how Kafka addresses this.

Kafka relies heavily on the file system for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact, disks are both much slower and much faster than people expect depending on how they are used. The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result, the performance of linear writes on a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of random writes is only about 50k/sec, a difference of nearly 10000X. These linear reads and writes are the most predictable of all usage patterns, and hence the one detected and optimized best by the operating system using read-ahead and write-behind techniques.6

The Kafka is designed around Operating System Page Cache. In computing, page cache, sometimes ambiguously called disk cache, is a "transparent" buffer of disk-backed pages kept in main memory (RAM) by the operating system for quicker access. Page cache is typically implemented in kernels with the paging memory management, and is completely transparent to

6 http://kafka.apache.org/design.html

2014 EMC Proven Professional Knowledge Sharing 12 applications 7 . Any modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is reclaimed. All disk reads and writes will go through this unified cache.

If your disk usage favors linear reads then read-ahead is effectively pre-populating page cache with useful data on each disk read. Read-ahead is the file prefetching technology used in the operating system to pre-fetch a few blocks ahead of time into the page cache. When a file is subsequently accessed, its contents are already in cache and read is happening from physical memory rather than from disk, which is much faster.

In the case of write, writing is done only to the cache. The write to the backing store (File System) is postponed until the cache blocks containing the data are about to be modified/replaced by new content. This technique is called Write-Behind Cache. Kafka always immediately writes all data to the OS cache and supports the ability to configure the flush policy that controls when data is forced out of the OS cache and onto disk using the flush. This flush policy can be controlled to force data to disk after a period of time or after a certain number of messages has been written. There are several choices in this configuration. 8 Kafka must eventually call fsync to know that data was flushed. This frequency of application-level fsync has a large impact on both latency and throughput. Setting a large flush interval will improve throughput as the operating system can buffer the many small writes into a single large write. This works effectively even across many partitions all taking simultaneous writes provided enough memory is available for buffering.

In Linux, data written to the Filesystem is maintained in page cache until it must be written out to disk (due to an application-level fsync or the OS's own flush policy). The flushing of data is done by a of background threads called pdflush (or in post 2.6.32 kernels "flusher threads").

Pdflush has a configurable policy that controls how much dirty data can be maintained in cache and for how long before it must be written back to disk. This policy is described here. When Pdflush cannot keep up with the rate of data being written it will eventually cause the writing process to block incurring latency in the writes to slow down the accumulation of data.

Using page cache has advantages over an in-process cache for storing data that will be written out to disk:

7 http://en.wikipedia.org/wiki/Page_cache 8 http://kafka.apache.org/08/ops.html

2014 EMC Proven Professional Knowledge Sharing 13

 The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.  The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which reduce the disk fragmentation and improves throughput.

It is interesting to note that, in case of HDFS, the overhead of I/O scheduler was causing the performance issues which impact on lower throughput whereas Kafka is relying on the I/O scheduler to improve its throughput. Till now we discussed how Kafka improved Disk access pattern. Kafka did few more optimization on top of this efficient disk access pattern. First is, Kafka buffers messages in a “Message Set” that naturally group the message together. That reduces the networking overhead and also fragmentation during message produce. The other major optimization is around how data is copied from disk to network socket for transfer (In case of consumer read the message from broker). Kafka uses a concept called “Zero Copy Write”. Which is again an OS optimization which efficiently transfer the data from page cache to network (NIC Buffer) directly bypassing the copying the data to user space.

We have seen here that Kafka designs suites very well for real time journaling infrastructure between asynchronous publisher and subscriber system to exchanges events in streaming fashion. However, there are systems which need to store large volume of documents and need to perform point queries on the data. A Kafka system is not very good for point queries. For that, we need to index the data for faster retrieval. While relational DB is a good fit for point queries because of its use of efficient indexing, it has a scalability bottleneck. The B-Plus Tree (B-Tree) index which Relational DB uses cannot scale very well for heavy write load. Let look at the issues with B-Tree-based indexing and how MongoDB solves this problem by keeping the index and data files in memory as memory-mapped file.

2014 EMC Proven Professional Knowledge Sharing 14

MongoDB: Memory Mapped File to Store B-Tree Index and Data Files To understand the MongoDB storage architecture, let us first try to understand the B-Tree data structure. In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a in that a node can have more than two children.9 In B- trees, internal (non-leaf) nodes can have a variable number of child nodes within some pre- defined range. When data is inserted or removed from a node, its number of child nodes changes. To maintain the pre-defined range, internal nodes may be joined or split. Because a range of child nodes is permitted, B-trees do not need re-balancing as frequently as other self- balancing search trees, but may waste space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation.

Below is the diagram of a B-Tree of order B and height is .

The height of the tree is the Order of search for a key; e.g. a B-tree of order 101 and height 3 can hold 1014 – 1 items (approximately 100 million) and any item can be accessed with 3 disc reads (assuming we hold the root in memory). B-trees have two overarching traits that make them ideal for database indexes. First, they facilitate a variety of queries, including exact matches, range conditions, sorting, prefix matching, and index-only queries. Second, they’re able to remain balanced in spite of the addition and removal of keys.

A major issue with Disk-based B-Tree data structure is that the insertions operation is very slow for large volume of writes. This is because even though B-tree operations are O (log N) which is considered essentially equivalent to constant time, this is not true for disk operations. Disk seeks come at 10ms per, and each disk can do only one seek at a time so parallelism is limited. Hence, even a handful of disk seeks lead to very high overhead. If you consider B-Plus Tree

9 http://en.wikipedia.org/wiki/B-tree

2014 EMC Proven Professional Knowledge Sharing 15 which is a variance of B-Tree, they keep all records in the leaf nodes and they chain nodes at the leaf level or at each level using “sibling” pointers as you can see here.

When inserting a new key in an already full Page of a B-Tree, the tree needs to split. The issue in this is the newly inserted page may not be next to each other on disk. Thus, scanning of data (for doing range queries on a given Key range) is a very expensive operation in B-Tree as it needs to perform multiple seeks. As a result, B-Tree scales are limited by disk seek rate and that is very slow. On the other hand, in the case of Kafka, the Log structured file for “append only” data structures is best suited for Range scan (in Kafka, it is the Time Range scan) because it scales on Disk Transfer rate.

To solve this slow access pattern of Disk due to seek overhead, MongoDB stores all its indexes and data files as Memory Mapped File. A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion (or whole) of a file. The primary benefit of memory mapping a file is to increase I/O performance, especially when used on large files. Memory-mapped files are the critical piece of the storage engine in MongoDB. By using memory-mapped file, MongoDB can treat the contents of its data files as if they were in memory. This provides MongoDB with an extremely fast and simple method for accessing and manipulating data10 without worrying about expensive disk seek.

The following figure presents how the various components of a MongoDB node (Disks, File System, and RAM) interact to provide access to the database.11

10 http://docs.mongodb.org/manual/faq/storage/ 11 http://www.polyspot.com/en/blog/2012/understanding-mongodb-storage/

2014 EMC Proven Professional Knowledge Sharing 16

A “namespace” in MongoDB is the concatenation of the database name and the collection names. Collections are containers for documents that share one or more indexes. The Database is a group of collections stored on disk using a single set of data files. MongoDB stores its data into files which are broken into extents which contains the documents. Extent can grow to a size of 2GB (actually, the process is a bit more complex).

B-Tree Index of a Mongo database is also stored as file which has its own extent. MongoDB creates such extents on demand as the database grows, pre-allocating the whole extent for efficiency reasons. Each extent is actually mapped to a list of disk blocks, each containing part of the data. By pre-allocating extents in slices of 2GB, MongoDB aims to reduce fragmentation of the files’ block on disk.

A main design choice of MongoDB that greatly impacts its performance and operations is the decision to delegate memory management to the underlying OS (similarities can be seen with Kafka which also delegate memory management to OS). This is represented in the figure above by solid red lines connecting the mapped disk blocks to the virtual memory pages.

Virtual memory is a representation of the real memory that allows processes to abstract away from the nitty-gritty details of addressing specific regions of the RAM. Memory map thus allocates a segment of this memory space to each of the MongoDB extents, enabling them to read/write data as if it were stored in memory. This virtual memory is then mapped a second

2014 EMC Proven Professional Knowledge Sharing 17 time by the OS to physical memory. This is done transparently when processes access virtual memory: If the memory page is not mapped, a page fault occurs and the OS to find an available page to load the data from disk (or write to it as if this is a newly allocated page purely in RAM).

If there is free memory, the operating system can find the page on disk and load it to memory directly. However, if there is no free memory, the operating system must:

 Find a page in memory that is stale or no longer needed, and write the page to disk.  Read the requested page from disk and load it into memory.

This process, particularly on an active system, can take a long time especially in comparison to reading a page that is already in memory. This is an issue in MongoDB when the working set (the Data used by program and index file) exceeds the physical RAM size. MongoDB performance degrades due to heavy page faults.

As discussed earlier, the B-Tree has a scalability problem with heavy write load and it performs only on Disk Seek rate. MongoDB solves that issue to some extent by keeping the working set as memory-mapped file and eliminating expensive disk seeks. However, once the working set size exceeds the memory size, the scalability issue of B-Tree surfaces because of heavy page faults which leads to multiple disk seeks to bring the necessary page to memory. Thus, MongoDB performance will come back to Disk Seek rate once data grows rapidly.

HBase is another Big Data storage solution. A popular NoSQL columnar store, HBase does not have a problem of B-tree, can scale to very large write bandwidth, and can perform efficient Range Query- and Key-based retrieval. HBase uses another complex Data Structure called Log Structured Marge Tree (LSM Tree) to achieve this goal. We will now discuss how LSM Tree works and how HBase uses LSM tree to scale workloads of thousands writes/seconds.

2014 EMC Proven Professional Knowledge Sharing 18

HBase: Log Structured Merged Tree – Write Optimized B-Tree To summarize what we have learned thus far, B-Tree is an excellent data structure for faster lookup and point query (MongoDB, Relational Database), and log structured files (Kafka) are an excellent data structure for fast insert. While the primary reason for poor B-Tree insertion rate is that it is limited by Disk Seek rate, Log Structured files can scale as it works on Disk Transfer rate. On the other hand, search is very fast on B-Tree as it requires minimal seeks to find a record ( ) whereas a Log Structured file can take multiple seeks (O (N)) to find a record. However, Log Structured file is very good for Range Scan of data as it again works on Disk Transfer rate as data is stored next to each other on Disk, whereas B-Tree range scans are poor as sequential pages may not be in consecutive location in disk and consequently must work on Disk Seek Rate.

A data structure which would take the best of both solutions would be ideal. Various research has been done on this front and some excellent papers12 published on how to make B-Tree write optimized.

This diagram shows the optimal tradeoff curve to make B-Tree as close to a Logging system in terms of insertion speed without compromising the point query. LSM Tree is one way to make B-Tree write optimized.

LSM Tree uses both B-Tree and Log type append only data structures to take the benefit of both worlds. In computer science, LSM-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as

12 http://www.cs.georgetown.edu/~jfineman/papers/sbtree.pdf

2014 EMC Proven Professional Knowledge Sharing 19 transactional log data. LSM-trees maintain data in two separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

Thus, the LSM-tree is a hybrid data structure composed of two tree-like structures, known as the C0 and C1 components. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are inserted into the memory-resident C0 component. If the insertion causes the C0 component to exceed a certain size threshold, a contiguous segment of entries is removed from C0 and merged into C1 on disk.

Although the C1 component is disk resident, frequently referenced page nodes in C1 will remain in memory buffers as usual (buffers not shown), so that popular high level directory nodes of C1 can be counted on to be memory resident.13

As each new row of data is inserted, a log record is first written to the sequential log (Write Ahead Log) file for durability and fault tolerance. The index entry for the row is then inserted into the memory resident C0 tree, after which it will in time migrate out to the C1 tree on disk; any search for an index entry will look first in C0 and then in C1. There is a certain amount of latency before entries in the C0 tree migrate out to the disk-resident C1 tree, implying a need for recovery of index entries that don't get out to disk prior to a crash (We will not discuss the recovery mechanism in this article but only mention the Write Ahead Log entry is used for the recovery process). LSM Tree needs an efficient way to migrate entries out to the C1 tree that resides on the lower cost disk medium. To achieve this, whenever the C0 tree as a result of an insert reaches a threshold size near the maximum allotted, an ongoing rolling merge process

13 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf

2014 EMC Proven Professional Knowledge Sharing 20 serves to delete some contiguous segment of entries from the C0 tree and merge it into the C1 tree on disk. HBase is using the similar concept to store and merge the in-memory entries with the on-disk entries and it maintains multiple on-disk files. In fact, every time in-memory tree flushes to disk creates a new on disk file (C1, C2, C3 …Ck Trees). Multiple on-disk trees eventually are merged after a certain threshold creates a completely new tree and a whole process continues. The diagram below shows the LSM tree with K+1 component.

To understand how HBase applies the LSM tree concept, let us first gain a brief understanding of HBase architecture.

Fundamentally, HBase table is a sparse multi-dimensional sorted map. Every Table has a key which is basically a combination of rowkey + column family + column qualifier + timestamp. For a given table, there will be a predefined column family but there can be unlimited column qualifiers. Every value inserted for

a key is time stamped.

An HBase table is spread across multiple Regions based on the Key ranges, and every region belongs to a Region Server, a physical Node which may contain multiple Regions. In the following diagram, A Table has 4 regions which spread across 3 Region Servers. Every Region holds the specific key ranges (e.g. Region 1 for Key a to d, Region 2 for Keys e to h, etc.)

2014 EMC Proven Professional Knowledge Sharing 21

HBase is a NoSQL database on top of HDFS, thus a Region Server is normally collocated with a Data Node (which holds the HDFS data blocks) of Hadoop cluster. As seen in the diagram below, a Region Server contains a single Journal File called WAL File, single Block Cache, and multiple Regions (HRegion). A Region contains multiple Stores (HStore), one for each Column Family. A Store consists of multiple Store Files (HFile) and a MemStore. Store Files and the WAL files are persisted in HDFS.

2014 EMC Proven Professional Knowledge Sharing 22

Whether you use insert/update record of a row in HBase, HBase receives the command and persist the change. When a write is made, by default, it goes into two places: the write-ahead log (WAL) of RegionServer which similar as transaction log also referred to as the HLog, and the MemStore of the Store. As you can see, if there are 3 Column Family of a table, there will be 3 MemStore and 3 HStore. And for every HStore there can be multiple StoreFiles (HFile) based on the volume of data being written for a column family.

Here Memstore is analogous to C0 tree of LSM Tree, and StoreFiles (HFile) are analogous to on-disk trees (C1, C2..Ck Trees) of LSM Tree.

The MemStore is a write buffer where HBase accumulates data in memory before a permanent write. Its contents are flushed to disk to form an HFile when the MemStore fills up. It doesn’t write to an existing HFile but instead forms a new file on every flush.

With this understanding let us now see how LSM Tree structure is used in this whole architecture. The WAL entry is mainly used for fault tolerant and recovery process, so leaving that aside, all the writes are also written to Memstore which is the in-memory copy of latest data. The Memstore structure is B-Tree index of the data which is most recent. When the system has accrued enough updates and starts to fill up the in-memory store, it flushes the sorted records to disk, creating a new store file. The store files are arranged similar to B-trees, but are optimized for sequential disk access where all nodes are completely filled and stored as either single-page or multipage blocks14 . The store files created on disk are immutable. Sometimes the store files are merged together; this is done by a process called compaction.

This diagram shows how multipage block are merged from the in-memory tree into the next on- disk tree. Merging writes out a new block with the combined result. Eventually, the trees are merged into the larger blocks.

14 HBase the Definitive Guide by Lars Gerge. O’Reilly Publication.

2014 EMC Proven Professional Knowledge Sharing 23

As more flushes take place over time, creating many store files, a background process aggregates the files into larger ones so that disk seeks are limited to only a few store files. All of the stores are always sorted by key, so no reordering is required to fit new keys in between existing ones. Lookups are done in a merging fashion in which the in-memory store is searched first, and then the on-disk store files are searched next.

There are many more complex things happening in HBase to perform better, i.e. performing two type of file compaction (major and minor), region splitting when region size grows to a certain limit, etc. Also HBase has optimized the LSM tree to include concepts like Bloom Filter and Block Caches to enhance the search process. Though these complex concepts are beyond the scope of this article, we hope to explain the fundamental building block of HBase and how it scales.

There are other NoSQL solutions as well which use LSM tree to scale. One such system is Cassandra, a similar columnar storage system to HBase.

Finally, regarding Disk Seek vs. Disk Transfer comparison, the results below show that the LSM Tree performs an order of magnitude better than B-Tree.

Given 10 MB/second transfer bandwidth, 10 milliseconds disk seek time, 100 bytes per entry (10 billion entries) with 10 KB per page (1 billion pages), updating 1% of entries (100,000,000) requires:

 1,000 days with random B-tree updates  100 days with batched B-tree updates  1 day with sort and merge LSM Tree

2014 EMC Proven Professional Knowledge Sharing 24

Thus, it can safely be concluded that at scale, seek is inefficient compared to transfer and LSM Tree can scale at disk transfer rate.

Up to now, we have discussed a couple of NoSQL solutions. We also discussed a Distributed Messaging System and also Hadoop to understand the fundamental scalability challenges they had and how those are solved. Let us now look at another spectrum of Big Data—Big FAST Data. There are various use cases where the velocity aspect of data is very prevalent, i.e. hundreds of thousands of messages flowing as stream and the system needs to perform computation on the fly on the stream data. A few open source systems have evolved to tackle this high velocity aspect of big data, the most popular is Storm (http://storm-project.net/). Storm solves one of the most complex problems in real time streaming system; Guaranteed Message Processing.

Storm: Efficient Lineage Tracking for Guaranteed Message Processing Real time big data processing is a key challenge in the big data space. There have been many use cases across various sectors which have created a real time data problem, i.e. fraud detection in financial services, network outage detection in telecom, real time recommendation in retail, and many more. Processing real time stream data is challenging because the volume and velocity aspect of data is huge. Handling the velocity of Big Data is not an easy task. First, the system should be able to collect the data generated by real time events streams coming in at a rate of hundreds of thousands of events per second. Second, it needs to handle the parallel processing of this data as and when it is being collected. Third, it should perform event correlation using a Complex Event Processing engine to extract the meaningful information from this moving stream. These three steps should happen in a fault tolerant and distributed fashion. The real time system should exhibit low latency so that the computation can happen very fast with near real time response capabilities. Storm is one such platform. Originally developed by BackType which was later acquired by Twitter, Storm is real time distributed fault tolerant continuous computation system.

The core abstraction in Storm is the "stream". A stream is an unbounded sequence of tuples. A Storm topology consumes streams of data and processes those streams, repartitioning the

2014 EMC Proven Professional Knowledge Sharing 25 streams between each stage of the computation however needed.15 Storm provides the primitives for transforming one stream into a new stream in a distributed and reliable way.

Storm Topologies are a combination of Spouts and Bolts. Spouts are where the data stream is injected into the topology. Bolts process the streams that are piped into it. Bolt can consume data from Spouts or other Bolts. Storm takes care of parallel processing of Spouts and Bolts and moving data around.

As shown in the diagram above, Storm topology is a graph of stream transformations where each node is a Spout or Bolt. Edges in the graph indicate which Bolts are subscribing to which streams. When a Spout or Bolt emits a tuple to a stream, it sends the tuple to every Bolt that subscribed to that stream. Each node in a Storm topology executes in parallel. In a topology, one can specify how much parallelism you want for each node, and then Storm will spawn that number of threads across the cluster to enable the execution.16

Storm distinguishes between three main entities that are used to actually run a topology in a Storm cluster:

1. Worker processes 2. Executors (threads) 3. Tasks

The following diagram17 shows the relationship between these components.

15 http://storm-project.net/ 16 https://github.com/nathanmarz/storm/wiki/Tutorial 17 http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

2014 EMC Proven Professional Knowledge Sharing 26

A worker process executes a subset of a topology and runs in its own JVM. An executor, a thread that is spawned by a worker process and runs within the worker’s JVM, may run one or more tasks for the same component (Spout or Bolt). One thread is used for all of its tasks, which means that tasks run serially on an executor. A task performs the actual data processing and is run within its parent executor’s thread of execution.

Storm internally uses very efficient event transfer protocol within and between the worker processes. The messaging that happens within a worker process in Storm is communication that is restricted to happen within the same Storm machine/node and is backed by LMAX Disruptor, a high performance inter-thread messaging library.18 The communication within the threads of a worker process is different from Storm’s inter-worker communication which normally take place across machines and thus over the network. For the latter, Storm uses ZeroMQ. The diagram below shows the Storm inter-process and intra-process communication.

18 http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/

2014 EMC Proven Professional Knowledge Sharing 27

Each worker process has a single receive thread that listens in on the worker’s TCP port. Similarly, each worker has a single send thread that is responsible for reading messages from the worker’s transfer queue and sending them over the network to downstream consumers. Each worker process controls one or more executor threads and each executor thread has its own incoming and outgoing queue. As shown above, the worker process runs a dedicated worker receive thread that is responsible for moving incoming messages to the appropriate incoming queue of the worker’s various executor threads. Similarly, each executor has its own dedicated send thread that moves an executor’s outgoing messages from its outgoing queue to the “parent” worker’s transfer queue.

It is clear from the Storm pipeline shown above that all processing is happening in memory without any disk usage. However, the issue with this is any machine that goes down in the pipeline will impact a set of worker processes and hence its executors. If a Storm node dies, messages given to the dead node needs to be reprocessed by some other machine with a guaranteed message delivery semantic. Every message delivery system can follow either of the delivery guarantee semantics shown below.

2014 EMC Proven Professional Knowledge Sharing 28

 At most once—Messages may be lost but are never redelivered.  At least once—Messages are never lost and will be redelivered.  Exactly once—Messages are delivered only once.

Storm performs data processing by providing an “at least once” delivery guarantee that each message coming off a Spout will be fully processed. But how does Storm know that every message coming out of Spout is fully processed? This is very critical for Storm as millions of messages can come out of a Spout, and storing sate for every message and tracking them in disk based storage is not a possible option. We have discussed that for traditional messaging systems, the major scalability bottleneck is storing and managing the state of the messages since the message state modification is limited by Disk Seek rate which is very slow. Thus, Kafka broker does not store the state of the message.

In a streaming system like Storm, events flow through a chain of processors (Spouts and Bolts) until they reach their destination. Each input event produces a directed graph (Tree) of descendant events (lineage) that ends at the final node. To guarantee reliable data processing, it is necessary to ensure that the entire graph was processed successfully or otherwise restart whole processing in case of failures. Storm does that by storing the processing status of the lineage tree.19 What is meant by a Lineage Tree (or we call it as Tuple Tree) that is “fully processed”? A tuple coming off a Spout can trigger thousands of tuples to be created based on it. Storm considers a tuple tree coming off a Spout as "fully processed" when the tuple tree has been exhausted and every message in the tree has been processed. A tuple is considered failed when part of its tree of messages fails to be fully processed within a specified timeout.20 Let’s examine how this happens in Storm.

A tuple created in a topology in Storm is given a random 64 bit ID. Every tuple downstream also knows the IDs of the initial Spout tuple (i.e. Root tuple) for which it exists in their tuple trees.

19 http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/ 20 https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing

2014 EMC Proven Professional Knowledge Sharing 29

For example:

Let assume there are two Storm Spouts (Black and Yellow) which emit two Tuples. One Tuple has ID (01111) and the other has ID (01001). For every Spout Tuple, Storm maintains a set of Acker process which stores the state of Tuple Tree. Suppose we have one Acker process.

When a tuple is acknowledged by a Bolt, it sends a message to the appropriate acker tasks with information about how the tuple tree changed. In particular, it tells the acker "I am now completed within the tree for this Spout tuple, and here are the new tuples in the tree that were created by me". When an acker sees that a Tree is fully processed, it sends a message to the Spout task that created the initial tuple to mark the root tuple as fully processed.

Storm also maintains a Task ID for every Spout Task which emits a Tuple in the tuple tree. In this example, assume that the two Spout Task ID which emits two tuple have Task ID 1 (Black Spout) and 2 (Yellow Spout), respectively. If there are multiple Acker, Storm maps the tuple ID to a corresponding Acker. As we have single Acker, it only maintains both the tuple tree generated by two Spout tuple.

Acker tasks do not track the tree of tuples explicitly. For large tuple trees with tens of thousands of nodes (or more), tracking all the tuple trees could overwhelm the memory used by the Ackers. Instead, the Ackers take a different strategy that only requires a fixed amount of space per Spout tuple (about 20 bytes). This tracking algorithm is the key to how Storm works and is one of its major breakthroughs.

An Acker task stores a map from a Spout tuple id to a pair of values. The first value is the Task ID that created the Spout tuple which is used later to send completion messages. The second value is a 64-bit number called the "ACK Val". The ACK Val is a representation of the state of the entire tuple tree, no matter how big or how small. It is simply the XOR of all tuple IDs that have been created and/or acknowledged in the tree.

Acker Task stores the details as: Tuple ID  [Task ID, ACK Value]

2014 EMC Proven Professional Knowledge Sharing 30

For just two Tuple emitted in our example, there will be two Map entries which look like this at this point:

01111 [1, 01111]

01001  [2, 01001]

ACK Value is same as Tuple ID as this is the only Tuple created to this point.

To expand this example further, assume the next Black Bolt in the topology accepts the Tuple 01111 and emits three more new tuple numbered 01100, 10010, and 00010 in the topology. The Black Bolt also acknowledged the tuple 01111.

The next Yellow Bolt accepts tuple 01001 and emits 11011. At this point, Yellow Bolt also acknowledged 01001. This is how it looks now.

With this change in the Tuple tree, the Map Entry needs to be changed. As both the Bolts send the new information to the Acker, Acker made the following changes in the Map Entry. The change will happen in the ACK Val which is now the XOR of existing ACK Value with all the values of tuple created or acknowledged in the tree.

New Map Entry is : 01111 [1, 11100]

[01111 (existing ACK Val) XOR 01111 (acknowledged val) XOR 01100(new val) XOR 10010(new val) XOR 00010 (new val)]

Similarly for Yellow Spout it will become

01001  [2, 11011]

At this point, if all Black Tuple or Yellow Tuple gets acknowledged by next Bolts, the ACK Val entry will become 00000. This is how the Acker Process will know that for a given Tuple Tree, all the messages are processed even without storing all individual tuple details or even storing the complete tree details. The complete example of Tuple Tree processing change in ACK Value when tuple tree is “fully processed” is shown below.

2014 EMC Proven Professional Knowledge Sharing 31

When the ACK Value becomes Zero, the Acker notifies the Spout Task which generates the root tuple saying that tuple is fully processed.

Now that we understand the reliability algorithm, let's look at all the failure cases and see how Storm avoids data loss in each case:

 A tuple isn't Acked because the Bolt task died: The Spout tuple IDs at the root of the trees for the failed tuple will time out and be replayed.  Acker task dies: All the Spout tuples the Acker was tracking will time out and be replayed.  Spout task dies: The source of the Spout is responsible for replaying the messages. For example, queues like RabbitMQ will place all pending messages back on the queue when a Storm Spout client disconnects.

By tracking the ACK Value of a tuple tree, Storm is able to manage the “At least once” delivery guarantee. This is because when the tuple tree is not fully processed, Spout will replay the whole tree by emitting the same root tuple from the Spout. But “At Least Once” process does not guarantee that intermediate nodes in the tuple tree process the tuple more than once. This may be a problem for some use cases where we need to calculate some critical count. If that calculation is re-processed due to some issue in a part of the downstream tuple tree, the final count value may be wrong. For that, Storm has a concept called Transactional Topology which provides the “Exactly Once” message delivery guarantee. However, its complexity is beyond the scope of this article.

2014 EMC Proven Professional Knowledge Sharing 32

Storm follows a “Record at a Time” processing model and keep the state of the tree to maintain the “At least once” message guarantee semantic. Even though the transactional topology can operate on “Exactly once” semantic, this is very slow as Storm maintain transactional state of every tuple. Better stream processing platforms are evolving to solve this issue and approach the problem in different ways. One such system is called Spark. Though not covered in this article, we recommend that you take a look at Spark to see how it uses a concept call Resilient Distributed Dataset21 to address stream data processing challenges.

Conclusion In this article. we covered a few popular open source projects in the Big Data space and dug deep into their core design principals to understand how they solve very specific problems using fundamental concepts of data structures, optimization of storage access patterns, and leverage Operating System principals to achieve horizontal scale. This knowledge is very important to understand how those systems scale and their limitations.

Big Data technology has become a very innovative space and many products are evolving that claim to solve various key challenges. To evaluate Big Data technology products, it is essential to understand their internal architecture, the problem they claim to solve, and how they did it. This article detailed a few of the products to help the reader understand if a given product can solve their Big Data challenges.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

21 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

2014 EMC Proven Professional Knowledge Sharing 33