Distribution, Data, Deployment: Software Architecture Convergence
Total Page:16
File Type:pdf, Size:1020Kb
FEATURE: BIG DATA SOFTWARE SYSTEMS Addressing the challenges of soft- Distribution, ware for big data systems requires careful design tradeoffs spanning the distributed software, data, and deployment architectures. It also Data, requires extending traditional soft- ware architecture design knowledge to account for the tight coupling Deployment that exists in scalable systems. Scale drives a consolidation of concerns, Software Architecture so that distribution, data, and de- ployment architectural qualities can no longer be effectively considered Convergence separately. To illustrate this, we’ll use an example from our current re- in Big Data Systems search in healthcare informatics. The Challenges Ian Gorton and John Klein, Software Engineering Institute of Big Data Data-intensive systems have long been built on SQL database tech- // Big data systems present many challenges to nology, which relies primarily on vertical scaling—faster processors software architects. In particular, distributed and bigger disks—as workload or software architectures become tightly coupled storage requirements increase. SQL databases’ inherent vertical-scaling to data and deployment architectures. This limitations4 have led to new prod- causes a consolidation of concerns; designs ucts that relax many core tenets of must be harmonized across these three relational databases. Strictly de ned normalized data models, strong data architectures to satisfy quality requirements. // consistency guarantees, and the SQL standard have been replaced by schemaless and intentionally de- normalized data models, weak con- sistency, and proprietary APIs that expose the underlying data man- agement mechanisms to the pro- grammer. These NoSQL products4 typically are designed to scale hori- zontally across clusters of low-cost, THE EXPONENTIAL GROWTH of data repositories ever constructed. moderate- performance servers. They data over the last decade has fueled a Their pioneering efforts,2,3 along achieve high performance, elastic new specialization for software tech- with those of numerous other big storage capacity, and availability by nology: data-intensive, or big data, data innovators, have provided a va- partitioning and replicating datasets software systems.1 Internet-born or- riety of open source and commercial across a cluster. Prominent examples ganizations such as Google and Am- data management technologies that of NoSQL databases include Cas- azon are on this revolution’s cutting let any organization construct and sandra, Riak, and MongoDB (see the edge, collecting, managing, storing, operate massively scalable, highly sidebar “NoSQL Databases”). and analyzing some of the largest available data repositories. Distributed databases have funda- 78 IEEE SOFTWARE | PUBLISHED BY THE IEEE COMPUTER SOCIETY 0740-7459/15/$31.00 © 2015 IEEE NOSQL DATABASES The rise of big data applications has caused a signi cant ux typically, some form of directed graph. They can provide ex- in database technologies. While mature relational database ceptional performance for problems involving graph traversals technologies continue to evolve, a spectrum of databases and subgraph matching. Because ef cient graph partition- called NoSQL has emerged over the past decade. The rela- ing is an NP-hard problem, these databases tend to be less tional model imposes a strict schema, which inhibits data concerned with horizontal scaling and commonly offer ACID evolution and causes dif culties scaling across clusters. In (atomicity, consistency, isolation, durability) transactions to response, NoSQL databases have adopted simpler data mod- provide strong consistency. Examples include Neo4j (www els. Common features include schemaless records, allowing .neo4j.org) and GraphBase (http://graphbase.net). data models to evolve dynamically, and horizontal scaling, by NoSQL technologies have many implications for applica- sharding (partitioning and distributing) and replicating data tion design. Because there’s no equivalent of SQL, each tech- collections across large clusters. Figure A illustrates the four nology supports its own query mechanism. These mecha- most prominent data models, whose characteristics we sum- nisms typically make the application programmer responsible marize here. More comprehensive information is at http:// for explicitly formulating query executions, rather than relying nosql-database.org. on query planners that execute queries based on declara- Document databases (see Figure A1) store collections of tive speci cations. The programmer is also responsible for objects, typically encoded using JSON (JavaScript Object combining results from different data collections. This lack of Notation) or XML. Documents have keys, and you can build the ability to perform JOINs forces extensive denormalization secondary indexes on nonkey elds. Document formats are of data models so that JOIN-style queries can be ef ciently self-describing; a collection might include documents with executed by accessing a single data collection. When data- different formats. Leading examples are MongoDB (www bases are sharded and replicated, the programmer also must .mongodb.org) and CouchDB (http://couchdb.apache.org). manage consistency when concurrent updates occur and Key–value databases (see Figure A2) implement a distrib- must design applications to tolerate stale data due to latency uted hash map. Records can be accessed primarily through in update replication. key searches, and the value associated with each key is treated as opaque, requir- ing reader interpretation. This simple model “id”: “1”, “Name”: “John”, “Employer”: “SEI” facilitates sharding and replication to cre- “id”: “2”, “Name”: “Ian”, “Employer”: “SEI”, “Previous”: “PNNL” ate highly scalable and available systems. (1) Examples are Riak (http://riak.basho.com) “key”: “1”, value{“Name”: “John”, “Employer”: “SEI”} and DynamoDB (http://aws.amazon.com/ “key”: “2”, value{“Name”: “Ian”, “Employer”: “SEI”, “Previous”: “PNNL”} dynamodb). (2) Column-oriented databases (see Figure “row”: “1”, “Employer” “Name” A3) extend the key–value model by organiz- “SEI” “John” ing keyed records as a collection of columns, “row”: “2”, “Employer” “Name” “Previous” where a column is a key–value pair. The key “SEI” “Ian” “PNNL” (3) becomes the column name; the value can be an arbitrary data type such as a JSON Node: Employee “is employed by” Node: Employer “id”: “1”, “Name”: “John” “Name”: “SEI” document or binary image. A collection can “id”: “2”, “Name”: “Ian” contain records with different numbers of “Name”: “PNNL” columns. Examples are HBase (http://hbase “previously employed by” (4) .apache.org) and Cassandra (https:/ /cassandra.apache.org). Graph databases (see Figure A4) orga- FIGURE A. Four major NoSQL data models. (1) A document store. (2) A key– nize data in a highly connected structure— value store. (3) A column store. (4) A graph store. MAY/JUNE 2015 | IEEE SOFTWARE 79 FEATURE: BIG DATA SOFTWARE SYSTEMS mental quality constraints, defined igently evaluate candidate database lion.10 Analysis of petabytes of data by Eric Brewer’s CAP (consistency, technologies and select databases across patient populations, taken availability, partition tolerance) the- that can satisfy application require- from diverse sources such as insur- orem.5 When a network partition ments. This often leads to polyglot ance payers, public health entities, occurs (causing an arbitrary mes- persistence—using different data- and clinical studies, can reduce costs sage loss between cluster nodes), a base technologies to store different by improving patient outcomes. In system must trade consistency (all datasets in a single system, to meet addition, operational efficiencies readers see the same data) against quality attribute requirements.8 can extract new insights for disease availability (every request receives a Furthermore, as data volumes treatment and prevention. success or failure response). Daniel grow to petascale and beyond, the Across these and many other do- Abadi’s PACELC provides a practi- required hardware resources grow mains, big data systems share four cal interpretation of this theorem.6 If from hundreds to tens of thousands requirements that drive the design of a partition (P) occurs, a system must of servers. At this deployment scale, suitable software solutions. Collec- trade availability (A) against consis- many widely used software architec- tively, these requirements represent tency (C). Else (E), in the usual case ture patterns are unsuitable. Archi- a significant departure from tradi- of no partition, a system must trade tectural and algorithmic approaches tional business systems, which are latency (L) against consistency (C). that are sensitive to hardware re- relatively well constrained in terms Additional design challenges for source use can significantly reduce of data growth, analytics, and scale. scalable data-intensive systems stem overall costs. For more on this, see First, from social media sites to from the following three issues. the sidebar “Why Scale Matters.” high-resolution sensor data collec- tion in the power grid, big data sys- tems must be able to sustain write- heavy workloads.1 Because writes Big data systems must be able to sustain are costlier than reads, systems can use data sharding (partitioning and write-heavy workloads. distribution) to spread write opera- tions across disks and can use rep- lication to provide high availability. Sharding and replication introduce First, achieving high scalability and Big Data Application