Transactional Processing for Polyglot Persistence

Ricardo Jimenez-Peris ∗, Marta Patino-Martinez˜ †, Ivan´ Brondino † and Valerio Vianello † ∗LeanXcale Madrid Email: [email protected] † Universidad Politecnica´ de Madrid Spain Email:mpatino,ibrondino,vvianello@fi.upm.es

Abstract—NoSQL data stores have emerged in last years as a data stores are non-transactional or at most allow transactions solution to provide scalability and flexibility in data modelling updating a single row. This fact creates many difficulties in for operational . These data stores have proven that the advent of failures and/or concurrent updates. It has been they are better suited for some kinds of problems than rela- tional databases. In order to scale, they relaxed the properties reported that BitCoin was stolen a lot of money because they 1 provided by relational databases, mainly transactions. However, were using NoSQL technology . A hacker ran in parallel transactional semantics is still needed by most applications. In massive transfers of money over a set of accounts, since there this paper we describe how CoherentPaaS provides scalable was no transactional isolation, the hacker was able to transfer holistic transactions across SQL and NoSQL data stores such as the original amount many times before the first transfer of document-oriented data stores, key-value data stores and graph databases. money was able to update the value of the account to zero. Index Terms—NoSQL, databases, transaction processing. The lack of atomicity also causes many problems when multiple rows should be updated as part of a business op- I.INTRODUCTION eration, a failure in the middle will result in an inconsistent since only a fraction of the rows would be updated. Companies have traditionally used transactional SQL For instance, if a key-value data store such as HBase is databases to keep their operational data. In the last few years, used in a banking application, if there is a failure during a with the advent of NoSQL, many companies are embracing money transfer between two accounts it might happen that the new NoSQL data stores. These data stores include docu- withdrawal from the origin account completes but the deposit ment oriented data stores, such as MongoDB [1], key-value in the destination account is not performed. This scenario will data stores, such as Apache HBase [2] and graph databases, lead to an inconsistent database. such as Neo4J [3] and Sparksee [4]. Companies use NoSQL The above problems become more acute in a Polyglot technology due to different reasons. In some cases, because Persistence environment. It is very typical to store structured the flexibility of the data models that enables to store semi- data in an operational SQL database and the semi-structured structured data that has a high degree of variability that makes part of the data on a NoSQL data store such as a key-value too hard to store them on relational databases oriented towards or document-oriented data store. However, both pieces of data fully structured data. This is for instance the case for key- are part of an integral conceptual business operation. A failure value data stores that are schemaless and enable to store tuples after one data store is updated and before a second data store arbitrary schemas. In some other cases, it is because of the is updated results in an inconsistent logical database (the set need of a different query language or API to access the data in of data stores being used). a more comfortable manner. For instance, document-oriented The CoherentPaaS FP7 European project [6] is addressing data stores such as MongoDB offer an API that enables to the problems that companies are facing when they use Polyglot query semi-structure documents in a flexible and comfortable Persistence. This paper addresses the set of pains related to way. Sometimes it is due to efficiency as happens with graph the lack of transactional ACID properties within and across databases. Graph databases enable to perform traversals of a data stores in these Polyglot environments. CoherentPaaS is graph with a single invocation to the what leveraging the ultra-scalable transactional technology from avoids a myriad of exchanges between the client and server LeanXcale [7] to bring scalable transactional processing to side needed in a relational database to make a graph traversal. Polyglot Persistence environments and to NoSQL data stores. Another reason why NoSQL solutions have been adopted is The rest of paper presents an overview of CoherentPaaS due to some of them satisfy the scalability requirements of (Section II), then transactions are presented in Section ??. some applications that traditional transactional SQL databases Finally, we present an overview on how transactions have fail to fulfill. been scaled and the modifications needed on the data stores This trend has resulted in a situation where many companies in order to provide transactional semantics and transactions now store their operational data in a diverse set of technologies across different data stores (Section IV). both SQL and NoSQL leading to the so-called Polyglot Persis- tence [5]. Companies with Polyglot Persistence environments 1http://hackingdistributed.com/2014/04/06/another-one-bites-the-dust- are facing now new challenges. The first one is that NoSQL flexcoin/ PaaS Management constrains the concurrency of updates. The issue is that read

Network and write operations of concurrent transactions conflict that is, Monitoring Machine to Machine Applications Elastic Cloud Manager Deployer an arbitrary predicate can conflict with any concurrent update Applications Cloud Common Query Language Engine

on the same table. When the predicate is not performed over

Transactions

NO SQL indexed columns the only way to guarantee serializability is Document Cloud DS

Cloud Data Stores Data Cloud GraphDB Cloud DS

by locking the whole table what prevents doing updates over

Key-Value CDS Local Txn Mngs Like - the table concurrently with the predicate reads. SQL Column-oriented CDS Cloud Data Stores Data Cloud In-memory DB Concurrency Different isolation levels have been proposed. For instance, Controllers

SQL scalable DB

CEP

Oriented the ANSI isolation levels enable to reduce the number of - CEP CEP CEP Commit Processing Sequencers Table Complex Event Complex Event Processing conflicts by reducing the level of isolation. Unfortunately, Loggers the ANSI isolation levels below serializability can exhibit LeanXcale Transaction Managemer serious anomalies in most applications that results in data Fig. 1. CoherenPaaS components inconsistencies. Fortunately, in 1995 a new isolation level was proposed called Snapshot Isolation (SI) [?] that is very close to serializability but, it avoids the conflicts between reads and II.COHERENTPAAS writes, therefore avoiding on one hand the bottleneck between CoherentPaaS project goal is to reduce the effort required predicate reads and updates, and on the other hand, avoiding to build and increase the quality of the cloud applications the interference between long read queries and updates that using multiple cloud data management technologies via a also result in highly constraining the potential concurrency single query language, a uniform programming model, and across transactions. ACID-based global transactional semantics. Figure 1 shows Snapshot isolation provides for any running transaction a the components of the transactional management system, photograph or snapshot of the committed state of the database LeanXcale, the common query language, CloudMdsQL [10], as of the start time of the transaction. In this way, it provides used to access all the data stores using a common program- a high level of isolation since a running transaction does not ming language, and the different types of data stores that are observe any changes for concurr