Nosql Database Comparison: Bigtable, Cassandra and Mongodb
Total Page:16
File Type:pdf, Size:1020Kb
Running Head: NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB NoSQL Database Comparison: Bigtable, Cassandra and MongoDB CJ Campbell Brigham Young University October 16, 2015 1 NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB INTRODUCTION THE SYSTEMS Google Bigtable History Data model & operations Physical Storage ACID properties Scalability Apache Cassandra History Data model & operations Physical Storage ACID properties Scalability MongoDB History Data model & operations Physical Storage ACID properties Scalability Differences Conclusion References 2 NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB Introduction As distributed systems are adopted and grown to scale, the need for scalable database solutions which meet the application’s exact need has become increasingly important. In the early days of computing, databases were almost entirely relational. Today, new breeds of database have emerged, called NoSQL databases. They are a common element in the grand design for most distributed software platforms. Each database is suited to a slightly different purpose from its peers. This paper discusses the features, similarities, and differences of three NoSQL databases: Google Bigtable, Apache Cassandra, and MongoDB. The Systems In this section, each of the three NoSQL databases are analyzed in-depth, starting with Google Bigtable, then Apache Cassandra, and finally MongoDB. Analysis includes their history, data model, accepted operations, physical storage schema, ACID properties, and scalability. Google Bigtable History. Bigtable was designed within Google to meet their internal data processing needs at scale. It began development in 2004 as part of their effort to handle large amounts of data across applications such as web indexing, Google Earth, Google Finance and more (Google, Inc., 2006). It first went into production use in April 2005. In May 2015, Google released a public version of Bigtable called Cloud Bigtable as part of the Google Cloud Platform (O'Connor, 2015). Data model & operations. Bigtable offers semi-structured data. At a high-level view, it is a key-value store. Diving deeper, the value is a set of columns which can be unique for each row, as in a jagged array. Columns are grouped together in “column families” which allows for iterating across similar data and backend efficiency. Cells can contain multiple versions of the same data, indexed by timestamp, with a configurable limit to keep only recent entries. Data is sorted in lexicographic order by row key, which allows users to exploit key selection for good data locality, thereby increasing performance. Physical Storage. Google’s 2006 Bigtable paper describes its file structure as, “a sparse, distributed, persistent multidimensional sorted map.” Data is stored on Google 3 NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB File System (GFS) in the SSTable file format, which is optimized for reads/writes on similarly-keyed data. ACID properties. Data reads and writes are atomic on a per-row basis, regardless of how many columns that row contains. Atomic actions are not available across multiple rows. Scalability. The introduction to Google’s Bigtable paper claims the ability to “reliably scale to petabytes of data and thousands of machines.” It can be configured to optimize for different needs, such as availability or low latency. An example of configuring this is the ability to read from memory instead of hard-disk. Apache Cassandra History. The Cassandra project was created around 2008 by Avinash Lakshman and Prashant Malik. It is named after a mythological Greek prophet. Some reports online claim that the name is in opposition to Oracle’s database (The meaning behind the name of Apache Cassandra, 2013). The purpose of its creation was to power the inbox search feature for Facebook. The Cassandra project was open-sourced on Google code on July 2008, became an Apache Incubator project in 2009 and finally graduated to a top-level Apache project in 2010. While the open community continued to embrace Cassandra, Facebook actually tapered its usage. In 2010 Facebook released a new version of messaging which used HBase instead of Cassandra because they found the model “to be a difficult pattern to reconcile for our new Messages infrastructure” (Muthukkaruppan, 2010). Despite being abandoned by its parent project, Cassandra is ranked the most popular wide column store, and eighth-most popular database overall as of October 2015 (DB-Engines Ranking, 2015). Data model & operations. Cassandra’s data model has evolved over time. It began with column families and super column families. Only three data operations were initially available: insert, get and delete. The original design is completely unrecognizable in the Cassandra of 2015 (Ellis, n.d.). Today’s model looks more like a collection of denormalized non-relational tables, with a query language similar to relational databases. This provides a speed increase because there is no need to join across tables, although it comes at the price of data duplication. Tables can be updated live without locking or downtime (Datastax, 2015). Physical Storage. Data is stored across a cluster using a consistent hashing ring. Cassandra uses virtual nodes to rearrange data for load balancing. Therefore, adding or 4 NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB removing a node only affects its immediate neighbors (Ellis, n.d.). The nodes to which data is initially assigned are called “coordinators.” They can be configured to replicate N copies of the data across the cluster, with additional configurations for locality-awareness. This ring allows the cluster to operate without any single point of failure. Data is stored on filesystem. It is optimized for fast reads at the cost of slower writes. Changes are written to a local commit log, which then goes into a memory cache. At a dynamically-calculated threshold, data in memory dumps to hard disk. ACID properties. Communication within a cluster is based on the gossip protocol, an eventual consistency model. This means that like most distributed database systems, Cassandra is built for high availability and partitioning with eventual consistency. A useful feature, however, is that this consistency is configurable to meet specific use cases. Operations on a single node are ACID compliant, though not across the cluster. For transactional writes, Cassandra uses of a modified Paxos consensus protocol. This of course costs performance, and should only be used for transactionally-sensitive operations (Ellis, Lightweight transactions in Cassandra 2.0, 2013). Scalability. The distributed structure of Cassandra makes it a viable option for globally-replicated data. In 2011, Netflix performed a benchmarking test and reported that it is linearly-scalable (Cockroft & Sheahan, 2011). The University of Toronto performed a similar test in 2012 with similar results, explaining that “this comes at the price of high write and read latencies” (Rabl, et al.). Cassandra’s feature-richness is its own cost, however. Though the database itself is riddled with powerful tooling and configurability. As the complexity of the system increases, the learning curve also increases. Thus, the user base that can support Cassandra is smaller than other databases, and the availability of maintenance staff is ever-important. MongoDB History. MongoDB was created by 10gen in 2007 as the data layer to their platform as a service called Babble. The database got its name from the word “humongous” (History of MongoDB, n.d.). The market didn’t take to Babble very well, and so in 2009 the project was open sourced. By August of 2013, the project had become the central focus of 10gen’s development, so much that the company changed its name to MongoDB (Harris, 2013). Since then, it has become the world’s most popular NoSQL database (DB-Engines Ranking, 2015). 5 NOSQL DATABASE COMPARISON: BIGTABLE, CASSANDRA AND MONGODB Data model & operations. MongoDB was developed in a javascript-oriented environment, and it shows in its data structure. It is classified as a document store, or document-oriented database. These provide the same lookup functionality as a key-value store, but also provide visibility into the stored documents (MongoDB, 2015). Data is stored in BSON, or Binary JSON, which is just what you’d think: an optimized structure for JSON. In everyday usage, it looks almost exactly like JSON to developers. Because JSON usage is so widespread, MongoDB’s learning curve is small compared to other databases. This opens the api for querying, filtering and sorting based on values within the document, modifying individual document values, and MapReduce and aggregation functions. Documents are partitioned into collections in MongoDB as rows are partitioned into tables in a relational store. Documents in a collection should contain similar data and have the same structure, though this is not enforced. Physical Storage. The size of BSON objects is limited to 16MB. Just as documents can be queried by inner value, they can also be indexed. The administrator can define a sharding key to increase data locality, which optimizes aggregation functions (Suter, 2012). ACID properties. Operations are atomic on a document-level. This means that data which must be atomic must be within a single document. Atomic transactions are not possible across multiple documents. Scalability. MongoDB automatically manages horizontal scaling across shards. As a node is