An Evaluation of Cassandra for Hadoop

An Evaluation of Cassandra for Hadoop Elif Dede, Bedri Sendir, Pinar Kuzlu, Jessica Hartog, Madhusudhan Govindaraju Grid and Cloud Computing Research Laboratory SUNY Binghamton, New York, USA Email:{edede1,bsendir1,pkuzlu1,jhartog1,mgovinda}@binghamton.edu Abstract—In the last decade, the increased use and growth important to evaluate the NoSQL model when used with the of social media, unconventional web technologies, and mobile MapReduce processing paradigm. In this paper, we analyze applications, have all encouraged development of a new breed various considerations when using Cassandra as the data store of database models. NoSQL data stores target the unstructured data, which by nature is dynamic and a key focus area for and Apache Hadoop for processing. Cassandra is an open “Big Data” research. New generation data can prove costly and source non-relational, column oriented distributed database. unpractical to administer with SQL databases due to lack of It is used for storing large amounts of unstructured data. structure, high scalability, and elasticity needs. NoSQL data Apache Hadoop is a well-known platform for data intensive stores such as MongoDB and Cassandra provide a desirable plat- processing. form for fast and efficient data queries. This leads to increased importance in areas such as cloud applications, e-commerce, We first identify and analyze various aspects of Cassandra, social media, bio-informatics, and materials science. In an effort and by extension NoSQL object stores, such as locality, scala- to combine the querying capabilities of conventional database bility, data distribution, load balancing, and I/O performance. systems and the processing power of the MapReduce model, We provide insights on strengths and pitfalls of using Cas- this paper presents a thorough evaluation of the Cassandra sandra as the underlying storage model with Apache Hadoop NoSQL database when used in conjunction with the Hadoop MapReduce engine. We characterize the performance for a wide for typical application loads in a cloud environment. We also range of representative use cases, and then compare, contrast, present and analyze the performance data of running the same and evaluate so that application developers can make informed experiments using Hadoop native, which uses Hadoop Dis- decisions based upon data size, cluster size, replication factor, 1 tributed Hadoop Distributed File System (HDFS) for storage. and partitioning strategy to meet their performance needs. The contributions of this paper are as follows: I. INTRODUCTION • Identify and describe the key NoSQL features required With the advent of the “Big Data” era, the size and for efficient performance with Hadoop. structure of data have become highly dynamic. As application • Discuss how the various features of Cassandra, such as developers deal with a deluge of data from various sources, replication and data partitioning, affect Apache Hadoop’s they face challenges caused by the data’s lack of structure performance. and schema. As such data grows and is constantly modified • Analyze the performance implications of running Hadoop via social media, news feeds, and scientific sensor input, the with Cassandra as the underlying data store. Verify per- requirements from the storage models have also changed. As formance gains and losses by processing application data the unstructured nature of the data limits the applicability typical in cloud environments, as well as classical I/O of the traditional SQL model, NoSQL has emerged as an and memory intensive workloads. alternative paradigm for this new non-relational data schema. NoSQL frameworks, such as DynamoDB [1], MongoDB [6], II. BACKGROUND BigTable [10] and Cassandra [23], address this “Big Data” A. MapReduce and Hadoop challenge by providing horizontal scalability. This, unlike the The MapReduce model proposes splitting a data set to vertical scalability scheme of traditional databases, results in enable its processing in parallel over a cluster of commodity lower maintenance costs. machines, which are called workers. Input distribution, While the NoSQL model provides an easy and intuitive way scheduling, parallelization and machine failures are all handled to store unstructured data, the performance under operations by the framework itself and monitored by a node called common in cloud applications is not well understood. The the master. The idea is to split parallel execution into MapReduce model has evolved as the paradigm of choice for two phases: map and reduce. Map processes a key and “Big Data” processing. However, studies with performance produces a set of intermediate key/value pairs. Reduce uses insights on the applicability of the MapReduce model to the intermediate results to construct the final output. NoSQL offshoots, such as MongoDB and Cassandra, have been lacking. As “modern” data is increasingly produced Apache Hadoop [2] is the most popular open source imple- from various sources, it becomes increasingly unstructured mentation of the model. Hadoop consists of two core compo- while continually growing in size with user interaction; it is nents: The Hadoop MapReduce Framework and the Hadoop Distributed File System (HDFS) [27]. Hadoop MapReduce 1Supported in part by NSF grant CNS-0958501. consists of a JobTracker that runs on the master node and TaskTackers running on each of the workers. The JobTracker 1) Data Model: Figure 1 shows the column oriented data is responsible for determining job specifications (i.e., number model of Cassandra. A column is the smallest component of of mappers, etc), submitting the user job to the cluster and data and it is a tuple of name, value and time stamp. Time monitoring workers and the job status. The TaskTrackers stamps are used for conflict resolution as multiple versions execute the user specified map or reduce tasks. Hadoop of the same record may be present. Columns associated with relies on HDFS for data distribution and input management. It a certain key can be depicted as a row; rows do not have a automatically breaks data into chunks and spreads them over pre-determined structure as each of them may contain several the cluster. Nodes hosting the input splits and replicas are columns. A column family is a collection of rows, like a called DataNodes. Each TaskTacker processes the input chunk table in a relational database. Column families are stored hosted by the local DataNode. This is done to leverage data using separate files, which are sorted in row key order. The locality. The input splits are replicated among the DataNodes placement of rows on the nodes of a Cassandra cluster depends based on a user-set replication-factor. This design on the row key and the partitioning strategy. Keyspaces are prevents data loss and helps with fault tolerance in case of containers for column families just as databases have tables in node failures. RDBMSs. 2) Replication and Consistency: Cassandra has automatic B. YSCB Benchmarking Suite replication to duplicate records throughout the cluster by a user The Yahoo Cloud Serving Benchmark (YCSB) [12] has set replication-factor. This is to ensure that failing been developed by Yahoo! engineers to analyze different data nodes do not result in data loss. Cassandra offers configurable stores under several workloads. YCSB is an open source consistency, which provides the flexibility to consciously make project that features benchmarks for many NoSQL tech- trade-offs between latency and consistency. For each read and nologies like Hbase [3], MongoDB [6], PNUTS [11], and write request, users choose one of the pre-defined consistency Cassandra [23]. The YCSB Core Package features five basic levels: ZERO, ONE, QUORUM, ALL or ANY [23]. In the workloads and each can be extended to produce new ones. In experiments of Section III-A we use the consistency level of Section III-A, we present performance results for running the ONE. This means that a write request is only considered done YSCB benchmarks Workload C on Cassandra under various when at least one server returns success in writing the entry to scenarios. When executed in load mode, this workload inserts its commit log. For the reads level ONE means that consulting a user-specified number of randomly generated records into only one replica node is sufficient to return the client request. the Cassandra database. Each record contains a randomly 3) Data Partitioning: Cassandra offers two main parti- generated key, 10 fields and each field is of 100 bytes. In order tioning strategies: RandomPartitioner and ByteOrderedPar- to compare Cassandra performance under different configura- titioner [19]. The former is the default strategy and it is tions and loads, we use the YCSB default “insertorder” which recommended for use in most cases. A hashing algorithm is inserts the records in hashed order of keys. In the run mode, used to create an md5 hash value of row key. Each Cassandra the user-specified number of records are read. Each record is node has a token value that specifies the range of keys for read as a whole without specifying any columns. which they are responsible. A row is stationed in the cluster based on the hash and range tokens. Distributing the records C. Cassandra evenly throughout the cluster balances the load by spreading Cassandra [23], developed by Facebook, is an open source, out client requests. The ByteOrderedPartitioner simply orders non-relational, column oriented, distributed database, devel- rows lexically by keys, so it may not distribute data evenly. oped for storing large amounts of unstructured data over 4) Read and Write: A client can contact any Cassandra commodity servers. Cassandra is a peer-to-peer model, which node for any operation. The node being connected to serves makes it tolerant against single points of failure and provides as a coordinator. The coordinator forwards the client request horizontal scalability. A Cassandra cluster can be expanded to the replica node(s) owning the data being claimed. on demand by simply starting new servers. These servers only For each write request, first a commit log entry is created.

Load more