On Practical Implications of Trading ACID for CAP in the Big Data

Alfred Zimmermann, Alexander Rossmann(Hrsg.): Digital Enterprise Computing2015, LectureNotes in Informatics (LNI), Gesellschaft für Informatik,Bonn 2015 247 On PracticalImplications of TradingACID for CAP in the Big Data Transformation of Enterprise Applications AndreasTönne1 Abstract: Transactional enterprise applications todaydependonthe ACID property of the underlyingdatabase. ACID is avaluable model to reason about concurrencyand consistency constraints in theapplication requirements. Thenew class of NoSQLdatabases that is used at the foundation of many bigdata architectures drops theACID properties andespecially gives up strong consistencyfor eventual consistency. Thecommonjustification for this change is theCAP- theorem, althoughgeneral scalabilityconcerns arealso noted.Inthis paper we summarizethe arguments for dropping ACID in favour of aweaker consistencymodelfor bigdata architectures. We then show theimplications of theweaker consistencyfor thebusinesscustomer andhis requirements when atransactional Java EE applicationis transformed to abig data architecture. Keywords: bigdata, acid, cap theorem, consistency, requirements, nosql 1Introduction The ACID propertyofrelational databases hasproventobe a valuable guarantee for specifying enterprise applicationrequirements. Business customersare enabled to express their business needsas if they are sequentiallyexecuted.The complicated details of concurrencycontrol of transactions andmaintaining consistencyare delegated to the database. ACID has become asynonymfor a promise of effective andsafe concurrency of business services. Andbusinesscustomers have learnedtoassume this as givenand as a reasonable requirement. The broadscale introductionofNoSQL[Ev09][Fo12] databases to enterprises for solvingbig data business needsintroducesanewclass of promises to the business customer.Strong consistencyas promised by ACID is replacedbythe rather weak eventual consistency[Vo09]. The CAP-theorem[Br00] is the excuse of the NoSQL developers andgenerallyofthe bigdata architects to weaken consistencyinfavourof availability, althoughmodern NoSQL database also offerhigherlevels of consistency. In this paper, we discussthe consequences of the transformationofenterprise applications to abig data architecture. Ourfocus lies on the practical implications of looseningconsistency forthe business customersand their requirements.We consider the severity of the implications, the difficultytobalance these with the customer expectations andstrategies forremedies. 1 NovaTec Consulting GmbH,Dieselstrasse 18/1,70771 Leinfelden-Echterdingen,andreas.toenne@novatec- gmbh.de 248 Andreas Tönne 1.1Transformation Scope The class of enterprise applications,which we consider are the OLTP (online transaction processing) applications.Asthe name implies, these applications criticallydependon transactional properties andfor this upholdingACID is seen mandatory.Our general interest liesinthe questiontowhat degree (and at what cost)can business needsthat todayrequire OLTP applications canbe transformedtothe bigdata paradigm. 1.2 BigDataChallenges OLTP applications are under a strong bigdata pressure. The threemaincharacteristics volume, variety and velocity of bigdata also applytomanytoday's enterprise applications.The volume of data to handle increases throughnew relevant sourceslike forexample the Internet of Things or the Germangovernments Industrie 4.0initiative [Bm15].Alsoglobalizationingeneral increases the amount of data to handle. The varietyofdata increasesgreatly because structured data source are increasingly accompanied or replacedbyunstructuredsources like the WeborEnterprise 2.0 initiatives.Accelerationofbusinessprocesses andthe successofbig data analytics(data science)ofcompetitors increase the demandsfor a higher velocity, especiallycombined with the required higherdata volumes. We thinkit is fair to state that many OLTP applications already operate under bigdata conditions.However they show a strong inertia to complete the transformation. Their architecture hindersthe integrationofunstructureddata source.Their choice of for example Java EE with a relational database meansahard limitationofscalability.And as will be showninthispaper,the practical consequences andcompromises forbig data scale processing are threateningtothe businesscustomer. 1.3Contribution We experiencedthese transformationchallengesinthe migrationofacustomer's middlewaresolutiontoabigdata architecture. The first transformationmilestone, establishing demonstrationcapabilities of the solution, wasanelevenman year effort. The architecture changedfromalayeredmonolithbased on Java EE andarelational database to a cloud readymicroservicestopology.We were usingthe Titan graph database [Th15],Cassandra[Ap15][LM10] as the NoSQL storage backendand Elasticsearch [El15] forindexing. The transformationwas critical to ourcustomerfor the simple reason that the existingimplementationwas fallingshort of the scaling requirements by threeordersofmagnitudes. The solutions main purpose is to provide ETL(extract, transform, load)services to enterprise applications in a flexible way. Structured andunstructureddata is extracted from a large range of sources(usingagents) andlosslessconverted to a commondata model. This data is persistedinthe style of a data lake[Fo15] but structured and On Practical Implications of TradingACID for CAPinthe Big Data Transformation 249 enriched with metadata. The metadata describes semantic andformal relations between the data recordssimilar to an ontology.The combined data structure is persisted in a distributed graphdatabase. The heart of the solutionare the analysis algorithmsand their qualityrequirements. Runningsuchanalysisconcurrently whilepreserving consistencyisacritical requirement. If forexample twodata recordsare analyzed fortheir relationships concurrently,duplicate relationslike "myusermatches your author"and "myauthor matches your user"would lead to wrongoverall relationweightsbetween the two records. Such concurrencyconflict is actuallyacommoncase: if sets of relatedrecords are modified in an application, these wouldbe imported foranalysisconcurrently andthe consistency of relations is threatened across many concurrent nodes. Consistencyrules like requiringuniqueness of commutative relations ("we have somethingincommon") were enforced in thebusinessrequirementsthrough uniqueness constraintsonthe database. It should be needlesstopoint out that this was ascalabilitykiller by design. 1.4Structure of this Paper In thefollowing section2we introduce the reader to theCAP theoremand eventual consistency anddiscuss their relevance forscalability. Section 3isthe main result of this paper, showingthe implications of switchingto eventual consistency forscalabilityreasons to ourbusinessproject. Some remainingchallenges are described in section4. 2The CAP Theorem, Eventual Consistency andScalability In this section, we provide an overview of theinfluences of the CAPtheorem and eventual consistencytoscalabilityconcernsofdistributed transactional applications.We motivate the decision to loosen consistencyuptothe point of eventual consistency.And we briefly show possibilitiestoachieve higher levels of consistency andtheir price in terms of scalabilityand effort. 2.1Scalabilityisthe New Top Priority The transformationofapplicationarchitectures from transactional monoliths to distributed cloud service topologies, from traditional to weband bigdata, usuallycomes with a shift of priorities.Aninfluential statement forthisshift wasgiven by Werner Vogels, Amazon CTO: "Eachnode in a system should be able to make decisionspurelybased on local state. If youneedtodosomethingunder high load with failuresoccurring andyou need to reach 250 Andreas Tönne agreement, you’re lost. If you’re concerned about scalability, anyalgorithmthat forces youtorun agreementwill eventuallybecome your bottleneck.Take that as a given." [Vo07][In07] The shift was from consistency as the main promise forapplicationproperties to availability of services, whichistreated second-class by ACID systems. Andthe ruling regime,aspointed outbyVogels, is the (horizontal) scalability. These newarchitectures are characterized by beingextremely distributed,bothinterms of numberofcomputing anddata nodes anddistance measured in communicationlatency andnetwork partition probability. Data replicationisusedtoachieve better proximityofdata andclient and better availabilityinthe presenceofnetwork partitions.Alsofor bigdata, replicationis often the only sensible backup strategy. 2.2BASE Semanticsand Eventual Consistency The conflict betweenACIDs consistency focusand the availabilitydemands of web basedscalable large distributed systemslead to the alternative BASE data semantics[Fo97] (basically available, soft state, eventual consistency)which uses the weak eventual consistency model. This basically meansthat after a write there is an unspecified period of time of inconsistencywhile the change is replicated through the distributed topology of data nodes. Which versionofthe data is presented at anode in this time window is pure luck.Eventuallyafter a period of no changes, the networkof data nodes stabilizes andagreesonthe last writtenversion. We do not want to embark on a discussion of otherstrongerconsistency models that are also offeredbysome NoSQLdatabases like Cassandra usingquorums or paxos protocols. When discussing this consistencyissue with business customersinthe light of their requirements,the interestingquestionis"howonearthdid they get away with such a lousyconsistency model?".

Load more