Data Structures of Big Data: How They Scale
Total Page:16
File Type:pdf, Size:1020Kb
DATA STRUCTURES OF BIG DATA: HOW THEY SCALE Dibyendu Bhattacharya Manidipa Mitra Principal Technologist, EMC Principal Software Engineer, EMC [email protected] [email protected] Table of Contents Introduction ................................................................................................................................ 3 Hadoop: Optimization for Local Storage Performance................................................................ 3 Kafka: Scaling Distributed Logs using OS Page Cache .............................................................10 MongoDB: Memory Mapped File to Store B-Tree Index and Data Files ....................................15 HBase: Log Structured Merged Tree – Write Optimized B-Tree ................................................19 Storm: Efficient Lineage Tracking for Guaranteed Message Processing ...................................25 Conclusion ................................................................................................................................33 Disclaimer: The views, processes, or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies. 2014 EMC Proven Professional Knowledge Sharing 2 Introduction We are living in the data age. Big data—information of extreme volume, diversity, and complexity—is everywhere. Enterprises, organizations, and institutions are beginning to recognize that this huge volume of data can potentially deliver high value to their business. The explosion of data has led to significant innovations on various technologies, all revolving around how this huge volume of data can be captured, stored, and processed to extract meaningful insights that will help to them make better decisions much faster, perform predictions of various outcomes, and so on. The Big Data technology revolution can be broadly categorized into following areas. Technologies around Batch Processing of Big Data (Hadoop, Hive, Pig, Shark, etc.) Technologies around Big Data Messaging infrastructure (Kafka) Big Data Database: NoSQL Technologies (HBase, Cassandra, MongoDB, etc.) Technologies around Real Time processing of Big Fast Data (Storm, Spark, etc.) Big Data Search Technologies (ElasticSearch, SolrCloud, etc.) Massively Parallel Processing (MPP) Technologies (HAWQ, Impala, Drill, etc.) For each of these broad categories of Big Data technologies, various products or solutions have either already matured or started evolving. In this article, we discuss a few of the most popular and prominent Open Source technologies and try to explain how efficiently each solution has applied various Data Structures concepts and used Computer Science principals to solve very complex problems. This article will highlight one of the key Big Data challenges each of these solutions was required to address and how they tackle those challenges using fundamentals of data structures and operating system concepts. This article will help the reader understand the major scalability concerns that big data brings to the table and how to build a system designed to scale. Let us first look at Hadoop, the most popular Big Data batch processing system, and see how it is optimized for local storage performance. Hadoop: Optimization for Local Storage Performance Hadoop has become synonymous with big data processing technologies. The most sought-after technology for distributed parallel computing, Hadoop solves a specific problem of big data spectrum. Its highly scalable distributed fault tolerant big data processing system is designed to run on commodity hardware. Hadoop consists of two major components: 2014 EMC Proven Professional Knowledge Sharing 3 1. Hadoop Distributed File System (HDFS), a fault tolerant, highly consistent distributed file system 2. Map Reduce engine which is basically the parallel programming paradigm on top of distributed file system The Hadoop framework is highly optimized for batch processing of huge amounts of data, where the large files are stored across multiple machines. HDFS is implemented as a user-level file system in Java which exploits the native Filesystem on each node, such as ext3 or NTFS, to store data. Files in HDFS are divided into large blocks, typically 64MB, and each block is stored as a separate file in the local file system. HDFS is implemented by two services, NameNode and DataNode. NameNode is the master daemon for managing the Filesystem metadata, and DataNode, the slave daemon, actually stores the data blocks. MapReduce engine is implemented by two services called JobTracker and TaskTracker where JobTracker is the MapReduce master daemon which schedules and monitors distributed jobs and TaskTracker is the slave daemon which actually performs the specific job tasks. Hadoop MapReduce applications use storage in a manner that differs from general-purpose computing. First, the data files accessed are large, typically tens to hundreds of gigabytes in size. Second, these files are manipulated with streaming access patterns typical of batch- processing workloads. When reading files, large data segments are retrieved per operation, with successive requests from the same client iterating through a file region sequentially. Similarly, files are also written in a sequential manner. This emphasis on streaming workloads is evident in the design of HDFS. A simple coherence model (write-once, read-many) is used that does not allow data to be modified once written (the latest version of Hadoop does support file appends, but it is a complex design which is beyond scope of this article). This is well suited to the streaming access pattern of target applications, and improves cluster scaling by simplifying synchronization requirements. Hadoop MapReduce programming model follows the Shared Nothing architecture and individual Map or Reduce tasks does not share any data structures and thus avoids the synchronization and locking issues. Each file in HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand. Manipulating data at this 2014 EMC Proven Professional Knowledge Sharing 4 granularity is efficient because streaming-style applications are likely to read or write the entire block before moving on to the next.1 HDFS is a user-level file system running on top of an operating system-level file system (e.g. ext3, ext4 file system on UNIX). Any read or write to HDFS uses the underlying operating system’s support for writing and reading from raw disk. The performance of raw disk is much better for linear read and write while random read and write disk performance is very poor due to seek time overhead. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind caching techniques that pre-fetch data in large block multiples and group smaller logical writes into large physical writes. We will discuss these two caching techniques in detail when we explain how Kafka (the real-time distributed messaging platform) uses operating system internal caching to scale. But does Hadoop really need the support for caching technique of Operating System? The answer is No. The Operating System cache is an overhead for Hadoop as sequential access pattern of MapReduce application which have minimum locality that can be exploited by cache. Hadoop would perform better if it can bypass OS Cache completely. This is a complex thing to do as Hadoop HDFS is written in Java, and Java I/O does not support bypass of OS Caching. Hadoop performance is impacted most when data is not accessed sequentially which may happen due to poor disk scheduling algorithm or when data gets fragmented on disk during write. Various studies on Raw Disk performance show that the read and write can reach its peak bandwidth when sequential run length (Sequential scan of data till random seek happened) is 32 MB. Thus, keeping a HDFS block size of 64MB is very reasonable. 1 http://www.cs.rice.edu/CS/Architecture/docs/phdthesis-shafer.pdf 2014 EMC Proven Professional Knowledge Sharing 5 However, optimized I/O Bandwidth during write and read may not always happen. Data may get fragmented on disk leading to poor read performance or the operating system scheduling algorithm can cause write bandwidth to decrease which also leads to fragmentation. Let us examine these two points of disk scheduling and fragmentation of disk in detail and see how Hadoop has solved this problem. HDFS 0.20.x performance degrades whenever the disk is shared between concurrent multiple writers or readers. Excessive disk seeks occur that are counter-productive to the goal of maximizing overall disk bandwidth. This is a fundamental problem that affects HDFS running on all platforms. Existing Operating System I/O schedulers are designed for general-purpose workloads and attempt to share resources fairly between competing processes. In such workloads, storage latency is of equal importance to storage bandwidth; thus, fine-grained fairness is provided at a small granularity (a few hundred kilobytes or less). In contrast, MapReduce applications are almost entirely latency insensitive, and thus should be scheduled to maximize disk bandwidth by handling requests at a large granularity (dozens of megabytes or more). It is found during testing that in Hadoop 20.x version, aggregate bandwidth