SQL, Nosql, NEWSQL DATABASES)
Total Page:16
File Type:pdf, Size:1020Kb
International Research Journal of Modernization in Engineering Technology and Science Volume: 01/Issue: 01/December-2019 www.irjmets.com AN OVERVIEW OF BIG DATA STORAGE AND MANAGEMENT TOOLS (SQL, NoSQL, NEWSQL DATABASES) 1Satish Chandra Reddy Nandipati, 2Chew XinYing*, 3Mohd Adib Omar 1,2,3School of Computer Sciences, 11800, Universiti Sains Malaysia, Pulau Pinang, Malaysia ABSTRACT A huge amount of data referred to as ‘big data’ which are in the form of structured, semi and unstructured data produced by various organizations is said to have a huge potential in wide range of sectors. The big data is stored and managed by relational (SQL and New SQL) and non-relational (NoSQL) database management systems for easy data accessing and analysis. This paper performs a brief literature review to elicit the general background of storage & management tools, analytical platforms, analysis and visualization tools that are used to handle the big data and its management systems. A brief history, evolution, some of the characteristic features, comparison, advantages, strengths and weakness, and performance of three SQL databases with respect to functional and non-functional features are explained. Finally moved to the application of these databases in healthcare, education, and transportation, etc. Besides advantages, these databases possess disadvantages that have been overcome by up-gradation of databases and emerging new databases such as NewSQL, and in the view of this the future of the databases has been illustrated in brief. KEYWORDS: Big Data Management, SQL Databases, NoSQL Databases, New SQL Databases, Application of Databases. I. INTRODUCTION The year 1937-1943 has been known to be the history of data project i.e., during the time of the 2nd world war to interpret Nazi codes by the British. A large amount of the data is produced and shared by different methods from different organizations such non-profit sectors, industry, scientific research, public administrations businesses and data related to earth, ocean, astronomy which are in the form of structured data (spreadsheets, relational data,), semi-structured data (CSV file, JSON documents, XML file) unstructured data (doc, pdf, email, audio, video and social media) [1-2]. The basics difference between traditional and big data is shown in Table1. Table 1. Comparison between Big data and traditional [3] Big Data Traditional Type of data Semi and Unstructured Structured Rate of data generation Rapid More time Sources of data Multiple sources Centralized Volume of data Peta and Zetta bytes Mega & Giga byte Data storage No SQL, Hadoop Distributed File System RDBMS The characteristics of big data consists of 3Vs [volume, velocity and variety], 4Vs [volume, velocity, variety, and variability], 6Vs [volume, velocity, variety, veracity, variability, and value], 7Vs, 10Vs and 42 Vs [2-3]. The data produced by different domains has a huge potential in improving the decision-making process in health care, predicting natural catastrophe, productivity, energy futures and economics [4]. Apart from advantages, the capturing, pre-processing, storage and management, sharing, data exploration, security, and privacy hasbeen an important challenge in big data analysis [5]. The first supercomputer was not able to process the big data which leads to a great challenge in handling this big data. The enhancement in computer technology has made possible www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [25] International Research Journal of Modernization in Engineering Technology and Science Volume: 01/Issue: 01/December-2019 www.irjmets.com that huge data can be managed without supercomputer and with less cost, by storing over the network. The storage and management of big datasets with reliability and availability of data accessing are refereed as ‘big data storage and management’. Based on the interfaces and functions difference, the data storage and management applications are divided into two parts they are file system (a filesystem organizes information on a hard drive, which controls storage and retrieving of the data) and a database (organized collection of data stored in a computer that are easily accessible). The users and programmers are provided with a software package to create, update, retrieve and manage data are referred as database management systems (DBMS). The four different types of DBMS are hierarchical databases, relational databases, object-oriented databases and network databases [6]. The identification and access of the data in relation to other data in a database are referred to as ‘relational database’. The collection of programs for maintaining the data which allows to create, update and administrate a relational database is referred as ‘relational database management system’ (RDBMS). The traditional RDBMS uses Structured Query Language (SQL) as a communication media for structured data analysis and management, this method utilizes more expensive hardware. Apart, this traditional RDBMS was not able to handle the heterogeneity and huge volume of big data obtained by semi-structured and unstructured data. To overcome this obstacle, different perspectives have been put forward by the research community and proposed that distributed file systems and NoSQL databases are of good choice to manage semi and unstructured data [2]. Apart from the above mention databases and to overcome some of their disadvantages, the rise of modern RDBMS called as NewSQL (which provides scalable performance of NoSQL for read-write workloads or online transaction processing (OLTP), while maintaining the ACID properties of SQL database system (i.e., Scalable performance of NoSQL + ACID properties of RDBMS = NewSQL) [7]. Some of the big data storage and management tools, analytics platforms, analysis and visualization tools are given in Table 2 [3]. Table 2.Databases for Big Data storage, Analysis and Visualization tools [3] Storage & management tools Analytics platforms Analysis Tools Visualization tools Apache Cassandra & HBase, Cloudera, Amazon Web Service, Apache, Storm, ChartBlocks, Datawrapper, CouchDB, Hive, Hypertable, Infinispan, Dreamer, Hadoop, IBM GridGain, HPCC, Jolicharts, Microsoft Power MongoDB, Neo4j, Riak, Errastore, Big Data, KNIME, MaPR, Pivotal GemFire BI, Plotty, Tableau, Weave, ZohoReports, CockroachDB NuoDB, Altibase CockroachDB, Microsoft XD,VoltDB, Azure, Open Refine II. LITERATURE REVIEW This section covers the history and evolution, highlights and advantages of three SQL databases a) History and evolution of SQL databases The initial development of SQL took place with the effort of Donald D. Chamberlin and Raymond F. Boyce after meeting Edgar F. Codd in 1972 at IBM T.J. Watson Research Center in Yorktown Heights, New York. During that time the new way of organizing data is named as a “relational data model” by E.F. Codd. Later, Donald and Raymond decided to make relational language more accessible to users who are not so familiar with both mathematics and computer programming. They found two levels (mathematical notation and at semantic level) that have to be resolved in order to overcome this problem and found the solution by replacing the keywords with symbols (i.e., replacing with ‘project’ and ∀ with ‘for all’). Apart from this the query language does not have the scope to extend the language, update and administrative tasks for creation of new tasks and views. After Codd’s symposium Donald and Raymond spent almost a year designing language. Later, to work on a System R project, Donald and Raymond moved to Jose Research Laboratory and began another new language called Sequel (Structured English Query Language). They hope that with little practice one can easily read queries similar to English prose. The sequel is a declarative language since it describes the information. In 1974, after the presentation of a paper on Sequel at technical conference in Michigan, Raymond died due to www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science [26] International Research Journal of Modernization in Engineering Technology and Science Volume: 01/Issue: 01/December-2019 www.irjmets.com ruptured brain aneurysm. After the death of Raymond, the Sequel language continued to be part of the System R project at San Jose Research Laboratory. Later, System R was installed in three IBM customer sites for experimental purpose, based on experience collected from early users and implements the complete Sequel language was designed and published in 1976. In 1977, Sequel is changed to SQL (structure Query Language) due to trademark issue [8]. In 1979, prior to IBM version (1981 and 1983), the SQL commercial product named Oracle is released by a company called Relational Software, Inc. In 1981, SQL/data system which is the first IBM product based on SQL was released, followed by DB2 in 1983 supporting many IBM platforms. The SQL consists of a set of properties for database transactions they are named as ACID (Atomicity, Consistency, Isolation, and Durability). b) Highlights of SQL Databases SQL is a standard and open-source query language; queries can be written easily the same as you write in English thus easy to learn. It acts as a medium of communication between user and DBMS since it makes easy and quick access to the data from the database. With the advancements in SQL, it can now handle large pools of data of all sizes. The evolution of SQL technology shows it can run on laptops, PCs, servers,