Graph Based Workload Driven Partitioning System for Nosql Database

Journal of Theoretical and Applied Information Technology 15 th July 2016. Vol.89. No.1 © 2005 - 2016 JATIT & LLS. All rights reserved . ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 GRAPH BASED WORKLOAD DRIVEN PARTITIONING SYSTEM FOR NOSQL DATABASE 1SHIVANJALI KANASE, 2SWATI AHIRRAO 1Symbiosis International University, Department of Computer Science, Pune 2Symbiosis International University, Department of Computer Science, Pune E-mail: [email protected] , [email protected] ABSTRACT Cloud computing is the rising technology, which deploys the scalable web applications and provides the storage solutions to the individuals and enterprises. The massively scalable NoSQL data store exhibits the highly desirable scalability and availability by using various partitioning methods, to handle the exponentially growing demands of the users. The data stores are partitioned horizontally and distributed across the geographically dispersed data centers to balance the load. However, if the partitions are not formed properly, it may result in expensive distributed transactions. There are number of existing partitioning techniques available like Hash and Range partitioning, but they are not effectively applied to the workload consisting of fairly small transactions, that touch fewer records and which also do not considers the n-to-n relationship between them. In this paper, “Graph-based Workload Driven Partitioning System for NoSQL Database using OLTP workload” is proposed to improve the scalability minimizing the distributed transaction. This system works in two phases: 1) generates the graph from transaction workload such as the database tuples are represented as nodes and number of tuples accessed in same transaction are connected by edges. Then graph partitioning technique is applied to find min-cut balanced partition. 2) Uses the decision tree classifier, which gives a range predicate partitioning identical to the partitions, formed by the graph partitioning algorithm. The TPC-C schema is implemented by utilizing Apache CouchDB NoSQL data store, in order to assess the proposed system. The experimental end result indicates that the graph-based schemes significantly improves the response time and latency. Keywords: Graph Partitioning Methods, OLTP Workload, Scalability, Distributed Transactions, NoSQL Database, Decision Tree Classifier, Lookup Table. 1. INTRODUCTION partitioning provides availability, easy maintenance and improvised query performance to the database The emergence of Big Data, Internet of users. In order to increase the scalability of Things and Hybrid Cloud applications enable the applications, many cloud providers, partition data business to perform faster, smarter and more and distribute these partitions across geographically efficiently than ever. The companies need vigorous dispersed data centers, to balance the load. foundation for data handling and running machine However, if the partitions are not correctly formed, critical applications in order to harness the it may result in expensive distributed transactions. competitive edge. This tremendous growth of The existing partitioning schemes like unstructured data and new applications has caused Round-robin, Range and Hash partitioning are to an evolution in database management companies. inefficiently applicable to the workloads having of Nowadays these companies are shifting to the small transactions, which touch a couple of records NoSQL database from traditional relational [1]. Moreover, these approaches also ignore the database, as they lack the flexibility and ability to relations among the database tuples and end up with scale, to handle globally massive amounts of the cluster of unrelated tuples on the same partition unstructured data. Nowadays people are equalizing which eventually results in the costly distributed NoSQL with the scalability. transactions. In OLTP workloads to ascertain NoSQL database can be scale out transactional properties, distributed transactions efficiently by utilizing various partitioning methods. should use a distributed consensus protocol, which With the objective of improving scalability, might result in immoderate and undesirable 85 Journal of Theoretical and Applied Information Technology 15 th July 2016. Vol.89. No.1 © 2005 - 2016 JATIT & LLS. All rights reserved . ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 latencies, decreased throughput and network to additional improve partitioning time of messages [2]. Hence, it is necessary to find largely increasing graph. partitioning technique, which will consider the relativity among the tuples while partitioning and • The practical implementation of graph- minimize the adverse effects of multisite based workload driven partitioning system transactions. on Apache CouchDB is presented and its performance is evaluated using TPC-C The proposed system frames the NoSQL schema. database workload in the form of graph wherein, the nodes represent the database tuples and each edge connects a couple of tuples accessed inside the Further this paper is structured as follows: same transaction. Graph partitioning technique is In Section 2, papers related to scalability and graph database partitioning is discussed. Section 3 gives applied to find min-cut k balanced partitions which brief overview of proposed system. Design of graph minimizes the multisite transactions. Unless the based workload driven partitioning system is workload characteristics change drastically, and presented in Section 4. An Implementation details tuples from a single partition are associated with one another, the events and unfavorable in Section 5 explains implementation and the consequences of Distributed Transactions are performance evaluation of the partitioning system. In Section 6 and Section 7, experimental setup and reduced rapidly. Lastly, the decision tree classifier results are described respectively. Finally, Section 8 techniques are applied to extract the set of rules to concludes the paper. explore the predicates, which explain the graph partitioning strategy. This step improves the 2. RELATED WORK runtime of the query processing. For more than few years, Scalable and The strengths of proposed system are: i) distributed database administration have been the Not dependent on any schema layout. ii) Can be area of interest of the database research group. implemented in the social networking data stores as Accordingly, a plenty of systems along with it considers the n-to-n relationship between data techniques for scalable database transaction have tuples. iii) Fine-grained approach for NoSQL been documented in the literature. This chapter database partitioning. This system is basically discusses related work done on different designed for OLTP workload using TPC-C schema. partitioning algorithms and graph partitioning In addition to this new partitioning approach, the techniques. proposed system offers some extra contributions: • A design is presented based on decision 2.1Workload-driven Partitioning Algorithms tree classifier which gives range predicate Data-driven partitioning approach called partitioning similar to graph partitioning 'Schism' is first introduced by Curino et al. [1], for OLTP databases. Schism models the workload as a algorithm and using this lookup table is a graph and enforces k-way min-cut graph built to improve query performance. partitioning algorithm to lessen the effect of • A mapping of TPC-C schema to Apache distributed transactions and increases the CouchDB NoSQL data store is presented. throughput. However, the graph generated by 'Schism' is extremely massive and usually does not • An integration of Apache CouchDB is deal with progressive workload deviation and done with Neo4j NoSQL graph database repartitioning. for representing OLTP workload in the form of a graph. Sudipto Das et. al presents “ ElasTras [3], • METIS graph partitioning algorithm is which is scalable and elastic to handle the varying implemented on the graph stored in Neo4j amount of loads with the additional functionality of graph database. fault-tolerance and self-management. ElasTras achieves the scalability by partitioning the large • The system takes a reasonable time that is database, using schema level partitioning which just a few minutes to partition the million restricts the transactional updates to a specific tuple dataset. Some heuristics like database partition. The associated rows are grouped Transaction level sampling, Tuple level together in a singular partition using schema sampling and Relevance filtering are patterns. Despite the fact that the system possesses proposed, to diminish the size of the graph the various functionalities, however, it lacks in 86 Journal of Theoretical and Applied Information Technology 15 th July 2016. Vol.89. No.1 © 2005 - 2016 JATIT & LLS. All rights reserved . ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 replication mechanism to benefit the high described by the Karypis and Kumar [7]. Typically availability. such multilevel schemes combine the vertices and edges using matching techniques, to decrease the J. Baker, et. al has developed a Megastore size of the graph. This procedure recursively [4] which is a scalable repository for interactive iterates till the enough small size graph is formed. online systems. Megastore integrates the scalability Then this reduced graph is initially partitioned into of a NoSQL with regard to RDBMS and offers the the k partitions which iteratively refines to find the strong consistency and high availability.

Load more