Data Management in Cloud Environments: Nosql and Newsql Data Stores Katarina Grolinger1, Wilson a Higashino1,2*, Abhinav Tiwari1 and Miriam AM Capretz1
Total Page:16
File Type:pdf, Size:1020Kb
Grolinger et al. Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:22 http://www.journalofcloudcomputing.com/content/2/1/22 RESEARCH Open Access Data management in cloud environments: NoSQL and NewSQL data stores Katarina Grolinger1, Wilson A Higashino1,2*, Abhinav Tiwari1 and Miriam AM Capretz1 Abstract Advances in Web technology and the proliferation of mobile devices and sensors connected to the Internet have resulted in immense processing and storage requirements. Cloud computing has emerged as a paradigm that promises to meet these requirements. This work focuses on the storage aspect of cloud computing, specifically on data management in cloud environments. Traditional relational databases were designed in a different hardware and software era and are facing challenges in meeting the performance and scale requirements of Big Data. NoSQL and NewSQL data stores present themselves as alternatives that can handle huge volume of data. Because of the large number and diversity of existing NoSQL and NewSQL solutions, it is difficult to comprehend the domain and even more challenging to choose an appropriate solution for a specific task. Therefore, this paper reviews NoSQL and NewSQL solutions with the objective of: (1) providing a perspective in the field, (2) providing guidance to practitioners and researchers to choose the appropriate data store, and (3) identifying challenges and opportunities in the field. Specifically, the most prominent solutions are compared focusing on data models, querying, scaling, and security related capabilities. Features driving the ability to scale read requests and write requests, or scaling data storage are investigated, in particular partitioning, replication, consistency, and concurrency control. Furthermore, use cases and scenarios in which NoSQL and NewSQL data stores have been used are discussed and the suitability of various solutions for different sets of applications is examined. Consequently, this study has identified challenges in the field, including the immense diversity and inconsistency of terminologies, limited documentation, sparse comparison and benchmarking criteria, and nonexistence of standardized query languages. Keywords: NoSQL; NewSQL; Big data; Cloud computing; Distributed storage; Data management Introduction many challenges in meeting the performance and scaling In recent years, advances in Web technology and the requirements of this “Big Data” reality. proliferation of sensors and mobile devices connected to Big Data is a term used to refer to massive and complex the Internet have resulted in the generation of immense datasets made up of a variety of data structures, including data sets that need to be processed and stored. Just on structured, semi-structured, and unstructured data. Ac- Facebook, 2.4 billion content items are shared among cording to the Gartner group, Big Data can be defined by friends every day [1]. Today, businesses generate massive 3Vs: volume, velocity, and variety [4]. Today, businesses volume of data which has grown too big to be managed are aware that this huge volume of data can be used to and analyzed by traditional data processing tools [2]. generate new opportunities and process improvements Indeed, traditional relational database management sys- through their processing and analysis [5,6]. tems (RDBMS) were designed in an era when the avail- At about the same time, cloud computing has also able hardware, as well as the storage and processing emerged as a computational paradigm for on-demand requirements, were very different than they are today network access to a shared pool of computing resources [3]. Therefore, these solutions have been encountering (e.g., network, servers, storage, applications, and services) that can be rapidly provisioned with minimal management * Correspondence: [email protected] effort [7]. Cloud computing is associated with service pro- 1Department of Electrical and Computer Engineering, Faculty of Engineering, Western University, London, ON N6A 5B9, Canada visioning, in which service providers offer computer-based 2Instituto de Computação, Universidade Estadual de Campinas, Campinas, SP, services to consumers over the network. Often these Brazil © 2013 Grolinger et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Grolinger et al. Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:22 Page 2 of 24 http://www.journalofcloudcomputing.com/content/2/1/22 services are based on a pay-per-use model where the NewSQL data stores with the intent of filling this gap. consumer pays only for the resources used. Overall, a More specifically, this survey has the following objectives: cloud computing model aims to provide benefits in terms of lesser up-front investment, lower operating To provide a perspective on the domain by costs, higher scalability, elasticity, easy access through summarizing, organizing, and categorizing NoSQL the Web, and reduced business risks and maintenance and NewSQL solutions. expenses [8]. To compare the characteristics of the leading Due to such characteristics of cloud computing, solutions in order to provide guidance to many applications have been created in or migrated to practitioners and researchers to choose the cloud environments over the last few years [9]. In fact, appropriate data store for specific applications. it is interesting to notice the extent of synergy between To identify research challenges and opportunities the processing requirements of Big Data applications, in the field of large-scale distributed data and the availability and scalability of computational management. resources offered by cloud services. Nevertheless, the effective leveraging of cloud infrastructure requires care- NoSQL data models and categorization of NoSQL data ful design and implementation of applications and data stores have been addressed in other surveys [10-14]. management systems. Cloud environments impose new In addition, aspects associated with NoSQL, such as requirements to data management; specifically, a cloud MapReduce, the CAP theorem, and eventual consistency data management system needs to have: have also been discussed in the literature [15,16]. This paper presents a short overview of NoSQL concepts Scalability and high performance, because today’s and data models; nevertheless, the main contributions applications are experiencing continuous growth in of this paper include: terms of the data they need to store, the users they must serve, and the throughput they should A discussion of NewSQL data stores. The category provide; of NewSQL solutions is recent; the first use of the Elasticity, as cloud applications can be subjected to term was in 2011 [17]. NewSQL solutions aim to enormous fluctuations in their access patterns; bring the relational data model into the world of Ability to run on commodity heterogeneous NoSQL. Therefore, a comparison among NewSQL servers, as most cloud environments are based on and NoSQL solutions is essential to understand them; this new class of data stores. Fault tolerance, given that commodity machines are A detailed comparison among various NoSQL and much more prone to fail than high-end servers; NewSQL solutions over a large number of Security and privacy features, because the data may dimensions. By presenting this comparison in a table now be stored on third-party premises on resources form, this paper helps practitioners to choose the shared among different tenants; appropriate data store for the task at hand. Previous Availability, as critical applications have also been surveys have included comparisons of NoSQL moving to the cloud and cannot afford extended solutions [11]; nonetheless, the number of compared periods of downtime. attributes was limited, and the analysis performed was not as comprehensive. Faced with the challenges that traditional RDBMSs A review of a number of security features is also encounter in handling Big Data and in satisfying the included in the data store comparison. According cloud requirements described above, a number of spe- to the surveyed literature [10-14], security has cialized solutions have emerged in the last few years in been overlooked, even though it is an important an attempt to address these concerns. The so-called aspect of the adoption of NoSQL solutions in NoSQL and NewSQL data stores present themselves as practice. data processing alternatives that can handle this huge A discussion of the suitability of various NoSQL volume of data and provide the required scalability. and NewSQL solutions for different sets of Despite the appropriateness of NoSQL and NewSQL applications. NoSQL and NewSQL solutions differ data stores as cloud data management systems, the im- greatly in their characteristics; moreover, changes mense number of existing solutions (over 120 [10]) in this area are rapid, with frequent releases of new and the discrepancies among them make it difficult to features and options. Therefore, this work formulate a perspective on the domain and even more discusses the suitability of NoSQL and NewSQL challenging to select the appropriate solution for a data stores for different use cases from the problem at hand. This survey reviews NoSQL and perspective of core design decisions. Grolinger et al. Journal of