<<

International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 6 ISSN 2229-5518 Big Data Systems : A Comparative Study Adarsh Kumar Research Scholar Department of Science, Himachal Pradesh University, Shimla-5 E-mail: [email protected] Anita Ganpati Associate Professor Department of , Himachal Pradesh University, Shimla-5 E-mail: [email protected] ------Abstract------We are living in 21st century digital world where large amount of data generated from various sources in bulk and in different format at high rate. In Current time data is more important than information and bulk data help in making correlation and help in decision making which is not possible with small data set. Different formats are not fit in one system. In this paper, a comparative study on different big data database systems in which discuss about classification, architectural properties and functionality. Keywords-BASE, Big Data, MVCC, NoSQL, Open Source, Scalable SQL, 1. Introduction processing that enable enhanced insight, decision In the early days of data generation, almost the data is making, and automation”[13]. generated in relational form typically tables, flat files 2. Overview of Big Data Database Systems and records which stored and analyzed by Large volumes of data are coming at a much faster traditional system such as MySQL, IBM DB2 etc. velocity in varieties of formats which is not handled Digital Revolution start almost two decade ago, by by traditional systems. The term “NoSQL” is the Internet or WWW and HTTP Protocol which emerging to handle this type of data and originally becomes the standardIJSER for information in develop by Carlo Strozzi in 1998.The definition of 1991[3]. Data was generated in diverse forms like “NoSQL” stands for “Not Only SQL” or “Not blog posts, tweets, social network interactions, Relational” [1]. The big data database systems are scientists detailed measurements etc. About 30,000 inspired by BigTable [2]- Persistent record Storage, gigabytes of data are generated every second, and the memcached (in memory indexes), and Amazon’s rate of data creation is accelerating [12]. Dynamo [4] is the idea of eventual consistency . A A Study by International Data Corporation on Digital key feature of NoSQL systems is “shared nothing” Universe in 2011 estimates the volume of data horizontal scaling and replicating and partitioning created and stored grow at 45.2% to 7,910 exabytes data over many servers. The proponents of NoSQL is projected in 2015 on the basis of 130 exabytes in often cite Eric Brewers’s CAP Theorem [8] :A 2005 and 1,227 exabytes in 2010[7] . distributed storage system must choose to sacrifice Big Data define as the amount of data just beyond the either consistency or availability while having traditional database methods and tools capability to tolerance . The term “BASE: Basically store, manage and process efficiently. Big data Available, Soft state, Eventually consistent” coined generally explained according to three V’s namely by Eric brewer [8] for these systems to handle the Velocity, Variety and Volume. According to needs of internet and cloud based models of storage Gratner, “ Bigdata is high-volume, high-velocity which abandon the ACID properties as a tradeoff for and/or high-variety information assets that demand their increased performance and Scalability. The cost-effective, innovative forms of information are grouped according to their strengths and CAP Theorem compromise by Big Data Working

IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 7 ISSN 2229-5518

Group [14] are relational, document –oriented, key- (objects) per database, nested documents or lists and value, BigTable-inspired, Dynamo-inspired, graph do not provide ACID transactional properties. and NewSQL. The Big data database systems are CouchDB Key-value Database Systems CouchDB [17] commonly refer as Apache CouchBD The data model used by key-value systems is similar is an open source database with an architecture that to memcached distributed in memory cache[5] in completely embraces the web written in Erlang and which a single key value for all the data. These license under Apache2.0. CouchDB store documents systems generally provide a persistence mechanism, with metadata and maintain its own schema. It , versioning, locking, transactions, sorting, provides a RESTful HTTP for reading and updating and client interface provides inserts, deletes and documents with lockless and optimistic. It distributes index lookup. data with incremental replication and supports Project Voldemort automatic conflict detection. It is highly available and Project Voldemort[15] is an open source, advanced partition tolerance with eventually consistency. It key- value database system written in Java with provides scalability through asynchronous replication substantial contributions from LinkedIn and license and not guarantees consistency through under Apache 2.0. It provides MVCC for updates implementation of MVCC on individual documents. replicas asynchronously and guarantees an up-to-date MongoDB for read a majority of replicas. It supports MongoDB[18 ] is a open source document database optimistic locking for consistent multi record update written in C, C++and Java Script which license under by Vector clocks provide an ordering on version and GPL v3.0 open source and Apache 2.0 . Documents automatic sharding of data by consistent hashing. similar to JSON object as a data structure composed Nodes can be added or removed from a database of field and value pairs. It provides high performance cluster, and the system adapts automatically and and data persistence by support of embedded data automatically detects and recovers failed nodes. It models and faster queries with indexes support and store data in RAM and also permits plugging in a atomic operation on fields. It provides high storage engine (Berkeley DB and Random Access availability by asynchronous replication facility, File storage). automatic failover and data redundancy. It provides Redis scalability by sharding. It stores the data on disk and Redis[16] is an open source BSD licensed), in- supports multiple storage engines WiredTiger and IJSERMMAPv1. memory data structure system, used as database, cache and message broker written in ANSI C. It Extensible Record Database Systems supports data structures such These systems are motivated by Google’s BigTable as strings, hashes, lists, sets, sorted sets with range [2] and Amazon’s Dynamo [4]. The basic model is queries, bitmaps, hyperloglogs and geo-spatial rows and columns in which rows are split across indexes with radius queries. A Redis server is nodes by sharding and columns by columns grouping accessed by a wire protocol implemented in various and their scalability model is spilting columns over client libraries and the distributed hashing over nodes. servers by client side. The servers store data in RAM Hbase and copied to disk for backup. System shutdown may Hbase[19 ] is a open source distributed hadoop be needed to add more nodes. Redis implements database system written in Java which License under insert, delete and lookup operations. It does atomic Apache 2.0. it provide Google’s BigTable like updates by locking and asynchronous replication, on capabilities on top of Hadoop and HDFS. It hosts disk persistence, high availability via Redis Sentinel, very large tables, supports operations with row automatic partitioning by Redis cluster. level locking and transactions and updates into Document Database Systems memory and periodically writes to files on disk by These systems store documents in form of articles, optimistic concurrency control with other updates. Microsoft Word files, etc. and a document is any kind The partitioning and distribution are automatic and of “pointer less object”. Most of systems support configurable sharding of tables with multiple master secondary indexes and multiple types of documents supports. It provides automatic fail over support, easy

IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 8 ISSN 2229-5518 to use Java API for client’s access and linear and 3. Literature Review modular scalability. Rick Cattell [1] has done a comprehensive survey on Cassandra Scalable SQL and NoSQL databases. In this paper, Cassandra [20] is a open source He classify these system on their data model, system written in Java, License under Apache 2.0, consistency control, data storage, durability, supports by DataStax and originally open source by availability, query support, and other dimensions into Facebook in 2008. It handles very large tables on key-value, document, extended record and relational. commodity hardware and cloud infrastructure with Tudorica, Bogdan George, and Cristian Bucur[10] no single point failure. It has groups, updates has done a comparison between several NoSQL are cached in memory and flushes to disk after databases with comments and notes. This paper is writing log on disk, disk representation is compacted trying to comment on the various NoSQL systems periodically. It provides no locking mechanism and and to make a comparison based on qualitative and replication asynchronously by MVCC, automatic quatitative point of view between Cassandra, Hbase failure detection and recovery. It uses phi accrual and MySQL. The qualitative view or criteria algorithm to detect a node failure and to determine compare features are persistence, replication, high cluster membership by a gossip style algorithms. It availability, transactions, rack-locality awareness, uses hash indexes and support versioning and conflict implementation language, influences / sponsors and resolution. license type . The quantitative evaluation criteria or Scalable Relational Systems view based on two sets, one related to size(number of Relational DBMSs have a pre-defined schema, a records/rows/document store, number of node in an query interface and ACID properties and not achieve installation) and other related to performance(Read scalability. By introduction of scalable relational and write latency in both write and read intensive system which provide scalability and avaliabilty and environment). The conclusion shows the SQL and the good per-node performance. NoSQL databases are having some shared features MySQL Cluster are not similar in a given instances. These systems MySQL Cluster [21] is a distributed shared nothing cannot be used interchangeable for solving any type system written in C and C++ has a part of the of problem, but choose between the two types of MySQL and license under GNU General Public databases for a given instance. License and a standard commercial License from Jing han et al. [9] has done a survey on NoSQL Oracle. It integrates IJSERMySQL server with in- memory database. This paper describes the background, basic cluster storage engine called NDB (Network characteristics, data model of NoSQL, classifies Database). It refers to a part of the setup that is NoSQL databases according to the CAP theorem, specific to storage engine. It provides real-time describe the mainstream NoSQL databases and on the response time, high throughput, auto sharding for basis of properties to help enterprises to choose write scalability by partitions tables across nodes and NoSQL. Based on the above knowledge of the 99.999% availability. It provides Synchronous mainstream NoSQL databases companies decide Replication, automatic failover, Self healing and whether to use NoSQL. In their study observed that Geographical Replication with no single point failure. companies need to consider the following options VoltDB when deciding which properties NoSQL are Data VoltDB [22] is a distributed shared nothing system Model, CAP Support, Multi Data Center Support, written in Java and C++, license under GNU affero Capacity, Performance, Query API, Reliability, Data Public License v3 and voltDB proprietary License. It Persistence, Rebalancing and Business Support. is a in-memory NewSQL System in which tables are Santhosh Kumar Gajendran [6] has done a survey on partitioned over multiple nodes and replicated database. The goal of this survey is to synchronously. It provides good per-node understand the current needs that have led to the performance and scalability and availability, and evolution of NoSQL databases, why relational Query processing is single threaded for each shard database systems were not able to meet these with transaction consistency (ACID) of traditional requirements and a brief discussion of some of the relational systems. successful NoSQL data stores. In their study,

IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 9 ISSN 2229-5518 common concepts underlying these databases and 1: Comparison of Big Data Database how they compromise on ACID properties to achieve Systems on Basic Parameters high scalability and availability. The NoSQL databases Dynamo, voldemort, CouchDB, DATA DATA PARAMETERS MongoDB, BigTable, HBase and Cassandra based BASE BASE Impleme Influe Persist on License type, concurrency control, data storage MODE SYSTE Licens ntation nces/ ence and replication are surveyed . Each database and its L M e type Languag Spons implementation has strengths at addressing specific enterprise or cloud concerns such as being easy to e ors operate, providing a flexible data model, high Key- Project Apach Java Linke Yes availability, high scalability and fault tolerance. value voldem e 2.0 dIn Manoj V [11] has done a comparative study on ort NoSQL databases are Cassandra, MongoDB and

Hbase on basis of architecture and working. The parameter of study are classification, architecture, Redis BSD C Eredis Yes availability, data model, partitioning and evaluation open , of Cassandra as industry use case. In their study that Source redox, MongoDB fits for use cases with document, carmi document search and aggregation functions are ne mandate. HBase suits the scenarios in which hadoop map reduce is useful for bulk read and load Docum Couch Apach Erlang Apach Yes operations and offers optimized read performance ent DB e 2.0 e with hadoop platform. Cassandra can be used for applications requiring faster writes and high Mongo GNU C++, C, 10gen, Yes availability. DB Gener Java linked 4. Objective and Scope of the Study al Script In, In this paper, a comparative study on different Pubic FourS database systems specially NoSQL and Scalable SQL Licenc qure in which discuss about classification, architectural e v3.0 IJSERopen properties and functionality. The parameters of study are License type, Implementation Language, source, Influences/ sponsors, persistence, Concurrency Apach control, Data storage, Replication and Transaction e 2.0 Mechanism. The Graph and object oriented database Extend HBase Apach Java BigTa Yes systems are beyond our scope. ed e2.0 ble Record 5. Research Methodology Cassan Apach Java BigTa Yes The research methodology followed a theoretical dra e2.0 ble, approach that comprises literature survey, articles, Amaz books, research papers and internet. on's 6. Analysis and results Dyna The study emphasis on the comparative study of mo, NoSQL and Scalable SQL systems are Project facebo Voldemort, Redis, CouchDB, MongoDB, HBase, ok Cassandra, MySQL Cluster and VoltDB. The comparative study or criteria compare parameters are License type, Implementation Language, Influences/ sponsors, persistence, Concurrency control, Data storage, Replication and Transaction Mechanism.

IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 10 ISSN 2229-5518

Scalabl MySQ GNU C/C++ Oracle Yes Key- Project MVCC RA Asynchr Not e L Gener value voldem M or onous MySQ Cluster al ort Plug L/ Public in a Licens stora Relatio e v3.0, ge nal a engi standa ne rd Redis Locks RA Asynchr Not comm M onous ercial

Licens e from CouchD MVCC Disk Asynchr Not Oracle Docum B onous ent VoltDB GNU Java SQL / Yes Mongo Locks Disk Asynchr Not Gener Hstore DB onous al LMA Extende HBase Locks Had Asynchr Local Public X d oop onous Licens Record e v3.0

Cassand MVCC Disk Asynchr Local The Apache 2.0 license type Systems are Project ra onous Voldemort, CouchDB, MongoDB, HBase, Cassandra, BSD open source is Redis, GNU General Public Scalabl MySQL ACID Disk Synchro Yes License v3.0 are MongoDB, MySQL Cluster and e Cluster nous MySQL VoltDB, and a standard commercial License for VoltDB ACID, RA Syschro Yes, / Oracle is MySQL Cluster.The systems are no M nous threadi Relatio implemented in different languages, in Java ng on implemented systemsIJSER are Project Voldemort, HBase, nal shards Cassandra and VoltDB, C and C++ implemented systems are Redis, MongoDB and MySQL Cluster, The concurrency control MVCC is provide by Project Erlang implemented is CouchDB and Java Script Voldemort, CouchDB and Cassandra, Locks is Implemented System is MongoDB. Persistence in provide by Redis, MongoDB and HBase, and ACID term of durability, failure detection and recovery is is provide by MySQL Cluster and VoltDB. The data provided by all systems. storage RAM is use by Project Voltemort, Redis and VoltDB, disk is use by CouchDB, MongoDB, Cassandra and MySQL Cluster. Asynchronous Table 2: Comparison of Big Data Database replication is provide by all NoSQL Systems are Systems on Functional Parameters Project Voltemort, Redis,CouchDB, MongoDB, HBase and Cassandra, Synchronous Replication is DATA DATA PARAMETERS provided by MySQL Cluster and VoltDB. BASE BASE Transaction mechanism are supported by HBase, MODE SYSTE Concur Data Replicati Transa Cassandra, MySQL Cluster and VoltDB, and not in L M rency Stor on ction Project Voltemort, Redis, CouchDB and MongoDB. control age Mecha nism 7. Conclusion and Future Scope

IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 11 ISSN 2229-5518

In this Paper, a comparative study of various NoSQL 10 Tudorica, Bogdan George, and Cristian Bucur. "A and Scalable Relational Systems based on their comparison between several NoSQL databases with classification and architectural properties and comments and notes." Roedunet International features. Each system has its own specific use and no one fit for all requirements. Each of them has been Conference (RoEduNet), 2011 10th. IEEE, motivated by varying requirements which has led to 2011.Lohr, Steve. The age of big data.New York their development mostly from the industry. The Times11 (2012). Project Voldemort, Redis, HBase, Cassandra MySQL Cluster and VoltDB is suited well for 11.Manoj, V. "Comparative study of nosql document, structured data. CouchDB and MongoDB is well column store databases and evaluation of suited for unstructured data. In Future, there will be cassandra." International Journal of Database scope of doing quantitative analysis based on throughput, number of node it handle by systems and Management Systems 6.4 (2014): 11. on various algorithms used for Concurrency control, 12. Marz, Nathan, and James Warren.Big Data: failure detection and transaction mechanism. Principles and best practices of scalable real time References data systems. Manning Publications Co., 2015. 1. Cattell, Rick. "Scalable SQL and NoSQL data stores." ACM SIGMOD Record39.4 (2011): 12-27. 13.“Big Data” http://www.gartner.com/it- 2. Chang, Fay, et al. "Bigtable: A distributed storage glossary/big-data/ system for structured data." ACM Transactions on 14.“BigDataTaxonomy” https://downloads.cloudsec Computer Systems (TOCS) 26.2 (2008): 4. urityalliance.org/initiatives/bdwg/Big_Data_Taxono 3. Dean, Jared. Big data, data mining, and machine my.pdf learning: value creation for business leaders and 15. “ Project Voldemort-a distributed database” practitioners. John Wiley & Sons, 2014. http://www.project-voldemort.com/voldemort/ 4. DeCandia, Giuseppe, et al. "Dynamo: amazon's 16.“Redis” http://redis.io/ highly available key-value store." ACM SIGOPS 17. “Apache CouchDB 1.6 Operating Systems Review. Vol. 41. No. 6. ACM, Documentation” http://docs.couchdb.org/en/1.6.1/ 2007. 18. “The MongoDB 3.2 5. Fitzpatrick, Brad. "Distributed caching with IJSERManual ” https://docs.mongodb.org/manual/ memcached." Linux journal2004.124 (2004): 5. 19. “Apache HBase ™ Reference 6. Gajendran, Santhosh Kumar. "A survey on nosql Guide” https://hbase.apache.org/book.html databases." University of Illinois (2012). 20. “Apache Cassandra 7.Gantz, John, and David Reinsel. "Extracting value 2.2” http://docs.datastax.com/en/cassandra/2.2/cassan from chaos." DC iview1142 (2011): 1-12. dra/cassandraAbout.html 8. Gilbert, Seth, and Nancy Lynch. "Brewer's 21. “ MySQL conjecture and the feasibility of consistent, available, Cluster” https://dev.mysql.com/doc/index- partition-tolerant web services." ACM SIGACT cluster.html News33.2 (2002): 51-59. 22.“VOLTDB 9. Han, Jing, et al. "Survey on NoSQL Documentation” https://docs.voltdb.com/ database." Pervasive computing and applications (ICPCA), 2011 6th international conference on.

IEEE, 2011.

IJSER © 2016 http://www.ijser.org