A Survey of Cloud Database Systems
Total Page:16
File Type:pdf, Size:1020Kb
FEATURE: CLOUD COMPUTING A Survey of Cloud Database Systems Ganesh Chandra Deka, Ministry of Labor & Employment, Government of India This survey of 15 popular cloud databases provides an overview of each system and its storage platform, license type, and programming language used for writing the source code of the NoSQLs. It also considers features such as data-handling techniques and billing practices. he exponential growth of the Internet to meet the requirements of different user groups. has resulted in an explosion of data It’s not possible to discuss all of them here, so sources, creating storage and data-us- I selected 15 popular NoSQL databases that ability problems. Furthermore, an in- are representative (rather than inclusive), and I Tcrease in the number of data types has created review some of their interesting characteristics. challenges in terms of storing and manipulating unstructured data. Surveyed Systems These issues have lead companies and open NoSQL databases can be divided into two source communities to build new tools, known groups, depending on the elasticity level. The as NoSQL systems or “key-value-store” systems, first group contains truly elastic databases, such which aim to offer, on a massive scale, on- as MongoDB, which allows the addition of new demand services and simplified application nodes to a cluster without any observable down- development and deployment. NoSQL databases time for the clients. The second group contains are useful for applications that deal with very rigidly defined BigTable-based NoSQL databases large semistructured and unstructured data. (such as Cassandra and HBase), which have sig- The growing popularity of big data will compel nificant downtime when new nodes are added to many companies to use NoSQL databases1 the cluster. instead of traditional database, so you can expect Constant data availability when nodes are to see vendors offering simplified rollouts and added or removed from the cluster is made additional support for NoSQL solutions.2 possible by routing mechanisms and algorithms According to nosql-database.org, there are at that decide when to move data chunks that least 150 NoSQL databases, with various features are working together. For example, when data 50 IT Pro March/April 2014 Published by the IEEE Computer Society 1520-9202/14/$31.00 © 2014 IEEE itpro-16-02-deka.indd 50 07/03/14 3:00 PM must be moved to newly added nodes, during stored in tablets already looked for will be made the copying process, the data is served from the directly to the last level of the tree. original location. When the new node has an up- to-date version of the data, the routing processes HBase start to send requests to this node.1 HBase is an open source, distributed, versioned, column-oriented data store modeled after Google’s Cassandra BigTable. Basically, it’s a clone of BigTable, The Apache Cassandra database offers good providing a real-time, structured database on top scalability and high availability without com of the Hadoop distributed file system. HBase is promising performance. Its demonstrated suitable for applications requiring random, real- fault-tolerance on commodity hardware (cloud time read/write access to big data. HBase’s goal infrastructures) and linear scalability make it is to host very large tables with billions of rows the ideal platform for mission-critical data. and millions of columns on top of clusters of Cassandra features allow replication across commodity hardware. multiple datacenters, offering lower latency HBase provides linear and modular scalability for data availability during regional outages. (ability of a database to handle a growing amount Cassandra’s ColumnFamily information model of data), consistent reads and writes, automatic offers the convenience of column indexes with and configurable sharding (a horizontal partition the performance of log-structured updates, in a database) of tables, and automatic failover strong support for materialize views (also known support between RegionServers. It also offers as snapshots), and powerful built-in caching. convenient base classes for backing Hadoop Netflix, Twitter, Urban Airship, Reddit, Cisco, MapReduce jobs with HBase tables, an easy-to- OpenX, Digg, CloudKick, and Ooyala are some use Java API for client access, block cache and of the companies that use Cassandra to deal Bloom Filters for real-time queries, as well as with huge, active, online interactive datasets. query-predicate pushdown via server-side filters. The largest known Cassandra cluster has over Finally, it has an extensible JRuby-based shell 300 Tbytes of information in over 400 machines and includes support via the Hadoop metrics (http://cassandra.apache.org). subsystem to files for exporting metrics via Java Management Extensions. BigTable Google’s BigTable maps two arbitrary string MongoDB values (row key and column key) and a time MongoDB is an open source, schema free, stamp (creating 3D mapping) into an associated document-oriented, scalable NoSQL database arbitrary byte array. BigTable can be characterized system. This high-performance, fault-tolerant, as a light, distributed, multidimensional sorted persistent system provides a complex query map. It was developed to scale to the petabyte language as well as an implementation of range among numerous machines to make it easy MapReduce. to add machines, automatically taking advantage MongoDB offers of those resources without any reconfiguration.3 When sizes threaten to grow beyond a specified • document-oriented storage—JavaScript Object limit, the tablets are compressed using the Notation (JSON)-style documents with dy- BMDiff 4,5 algorithm and the Zippy open source namic schemas offer simplicity and power; compression algorithm,6 publicly known as • full index support—that is, it can index any Snappy,5 which is a less space-optimal variation attribute; of the LZ77 algorithm but more efficient in terms • data availability—it can mirror across LANs of computing time (www.aosabook.org/en/posa/ and WANs for scalability; and infinispan.html). • autosharding—it scales horizontally without To get a specific row stored in BigTable, a new compromising functionality. client must connect to all levels of the tree, but the information obtained on the upper levels is It also supports rich, document-based queries; cached, meaning that further requests for data atomic modifiers for contention-free performance; computer.org/ITPro 51 itpro-16-02-deka.indd 51 07/03/14 3:00 PM FEATURE: CLOUD COMPUTING Requests Requests Requests Master copy of partition table/ tablet mapping reliability. Hypertable will be useful for organization that must administer rapidly evolving data support for online real-time applications. Modeled Tablet Routers controller after Google’s BigTable project, Hypertable is designed to manage information storage and processing on a large cluster of commodity servers, providing resilience to machine and component failures. CouchDB Tablets Apache CouchDB is a document- Tablet servers oriented database written using Erlang, a robust functional programming language ideal for building concurrent Figure 1. The Pnuts data storage architecture. Multiple applications distributed systems. CouchDB can be share this massive-scale centrally managed database system. queried and indexed using JavaScript in a MapReduce fashion. It also offers incremental replication with flexible aggregation and data processing; and bidirectional conflict detection and resolution. GridFS (a MongoDB file format for storing files It provides a RESTful JSON API that can be larger than 16 MB), so it can store files of any size accessed from any environment that allows without complicating your stack. HTTP requests. There are many third-party client libraries, making it easier to choose a programming Pnuts language. CouchDB’s built-in Web administration The Pnuts system is a massive-scale hosted, console speaks directly to the database using centrally managed database system shared by HTTP requests issued from the browser. multiple applications (see Figure 1). It supports Yahoo’s data-serving Web applications rather Voldemort than complex queries (such as offline analysis of Voldemort is a fully distributed key-value Web crawls).7 storage system. Each node is independent with Pnuts provides data management as a service. no central point of failure or coordination. This significantly reduces application development Voldemort is designed for use as a simple storage time, because developers don’t have to architect that’s fast enough to avoid needing a caching and implement their own scalable, reliable data- layer on top of it. The software architecture is management solutions.4 Consolidating multiple made of several layers, each of which implements applications into a single service lets users amortize the put, get, and delete operations. Each layer is operations costs over multiple applications and responsible for a specific function, such as apply the same best practices to the data man TCP/IP communications, routing, or conflict agement of many different applications. Moreover, resolution. having a shared service means you can keep With Voldemort, data is automatically replicated resources (servers, disks, and so on) in reserve and over multiple servers as well as automatically quickly assign them to applications experiencing a partitioned, so each server contains only a sudden surge in popularity.6 subset of the total data. Server failure is handled transparently, and pluggable serialization