A Comparative Study

International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 6 ISSN 2229-5518 Big Data Database Systems : A Comparative Study Adarsh Kumar Research Scholar Department of Computer Science, Himachal Pradesh University, Shimla-5 E-mail: [email protected] Anita Ganpati Associate Professor Department of Computer Science, Himachal Pradesh University, Shimla-5 E-mail: [email protected] ---------------------------------------------------------Abstract----------------------------------------------------------- We are living in 21st century digital world where large amount of data generated from various sources in bulk and in different format at high rate. In Current time data is more important than information and bulk data help in making correlation and help in decision making which is not possible with small data set. Different formats are not fit in one system. In this paper, a comparative study on different big data database systems in which discuss about classification, architectural properties and functionality. Keywords-BASE, Big Data, MVCC, NoSQL, Open Source, Scalable SQL, Scalability 1. Introduction processing that enable enhanced insight, decision In the early days of data generation, almost the data is making, and process automation”[13]. generated in relational form typically tables, flat files 2. Overview of Big Data Database Systems and records which stored and analyzed by Large volumes of data are coming at a much faster traditional system such as MySQL, IBM DB2 etc. velocity in varieties of formats which is not handled Digital Revolution start almost two decade ago, by by traditional systems. The term “NoSQL” is the Internet or WWW and HTTP Protocol which emerging to handle this type of data and originally becomes the standardIJSER for information sharing in develop by Carlo Strozzi in 1998.The definition of 1991[3]. Data was generated in diverse forms like “NoSQL” stands for “Not Only SQL” or “Not blog posts, tweets, social network interactions, Relational” [1]. The big data database systems are scientists detailed measurements etc. About 30,000 inspired by BigTable [2]- Persistent record Storage, gigabytes of data are generated every second, and the memcached (in memory indexes), and Amazon’s rate of data creation is accelerating [12]. Dynamo [4] is the idea of eventual consistency . A A Study by International Data Corporation on Digital key feature of NoSQL systems is “shared nothing” Universe in 2011 estimates the volume of data horizontal scaling and replicating and partitioning created and stored grow at 45.2% to 7,910 exabytes data over many servers. The proponents of NoSQL is projected in 2015 on the basis of 130 exabytes in often cite Eric Brewers’s CAP Theorem [8] :A 2005 and 1,227 exabytes in 2010[7] . distributed storage system must choose to sacrifice Big Data define as the amount of data just beyond the either consistency or availability while having traditional database methods and tools capability to partition tolerance . The term “BASE: Basically store, manage and process efficiently. Big data Available, Soft state, Eventually consistent” coined generally explained according to three V’s namely by Eric brewer [8] for these systems to handle the Velocity, Variety and Volume. According to needs of internet and cloud based models of storage Gratner, “ Bigdata is high-volume, high-velocity which abandon the ACID properties as a tradeoff for and/or high-variety information assets that demand their increased performance and Scalability. The cost-effective, innovative forms of information databases are grouped according to their strengths and CAP Theorem compromise by Big Data Working IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 7 ISSN 2229-5518 Group [14] are relational, document –oriented, key- (objects) per database, nested documents or lists and value, BigTable-inspired, Dynamo-inspired, graph do not provide ACID transactional properties. and NewSQL. The Big data database systems are CouchDB Key-value Database Systems CouchDB [17] commonly refer as Apache CouchBD The data model used by key-value systems is similar is an open source database with an architecture that to memcached distributed in memory cache[5] in completely embraces the web written in Erlang and which a single key value for all the data. These license under Apache2.0. CouchDB store documents systems generally provide a persistence mechanism, with metadata and maintain its own schema. It replication, versioning, locking, transactions, sorting, provides a RESTful HTTP for reading and updating and client interface provides inserts, deletes and documents with lockless and optimistic. It distributes index lookup. data with incremental replication and supports Project Voldemort automatic conflict detection. It is highly available and Project Voldemort[15] is an open source, advanced partition tolerance with eventually consistency. It key- value database system written in Java with provides scalability through asynchronous replication substantial contributions from LinkedIn and license and not guarantees consistency through under Apache 2.0. It provides MVCC for updates implementation of MVCC on individual documents. replicas asynchronously and guarantees an up-to-date MongoDB view for read a majority of replicas. It supports MongoDB[18 ] is a open source document database optimistic locking for consistent multi record update written in C, C++and Java Script which license under by Vector clocks provide an ordering on version and GPL v3.0 open source and Apache 2.0 . Documents automatic sharding of data by consistent hashing. similar to JSON object as a data structure composed Nodes can be added or removed from a database of field and value pairs. It provides high performance cluster, and the system adapts automatically and and data persistence by support of embedded data automatically detects and recovers failed nodes. It models and faster queries with indexes support and store data in RAM and also permits plugging in a atomic operation on fields. It provides high storage engine (Berkeley DB and Random Access availability by asynchronous replication facility, File storage). automatic failover and data redundancy. It provides Redis scalability by sharding. It stores the data on disk and Redis[16] is an open source BSD licensed), in- supports multiple storage engines WiredTiger and IJSERMMAPv1. memory data structure system, used as database, cache and message broker written in ANSI C. It Extensible Record Database Systems supports data structures such These systems are motivated by Google’s BigTable as strings, hashes, lists, sets, sorted sets with range [2] and Amazon’s Dynamo [4]. The basic model is queries, bitmaps, hyperloglogs and geo-spatial rows and columns in which rows are split across indexes with radius queries. A Redis server is nodes by sharding and columns by columns grouping accessed by a wire protocol implemented in various and their scalability model is spilting columns over client libraries and the distributed hashing over nodes. servers by client side. The servers store data in RAM Hbase and copied to disk for backup. System shutdown may Hbase[19 ] is a open source distributed hadoop be needed to add more nodes. Redis implements database system written in Java which License under insert, delete and lookup operations. It does atomic Apache 2.0. it provide Google’s BigTable like updates by locking and asynchronous replication, on capabilities on top of Hadoop and HDFS. It hosts disk persistence, high availability via Redis Sentinel, very large tables, supports row operations with row automatic partitioning by Redis cluster. level locking and transactions and updates into Document Database Systems memory and periodically writes to files on disk by These systems store documents in form of articles, optimistic concurrency control with other updates. Microsoft Word files, etc. and a document is any kind The partitioning and distribution are automatic and of “pointer less object”. Most of systems support configurable sharding of tables with multiple master secondary indexes and multiple types of documents supports. It provides automatic fail over support, easy IJSER © 2016 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 7, Issue 12, December-2016 8 ISSN 2229-5518 to use Java API for client’s access and linear and 3. Literature Review modular scalability. Rick Cattell [1] has done a comprehensive survey on Cassandra Scalable SQL and NoSQL databases. In this paper, Cassandra [20] is a open source distributed database He classify these system on their data model, system written in Java, License under Apache 2.0, consistency control, data storage, durability, supports by DataStax and originally open source by availability, query support, and other dimensions into Facebook in 2008. It handles very large tables on key-value, document, extended record and relational. commodity hardware and cloud infrastructure with Tudorica, Bogdan George, and Cristian Bucur[10] no single point failure. It has column groups, updates has done a comparison between several NoSQL are cached in memory and flushes to disk after databases with comments and notes. This paper is writing log on disk, disk representation is compacted trying to comment on the various NoSQL systems periodically. It provides no locking mechanism and and to make a comparison based on qualitative and replication asynchronously

Load more