Scalable Storage: the Drive for Web-Scale Data Management
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Storage: The drive for web-scale data management Bryan Rosander University of Central Florida [email protected] March 28, 2012 Abstract Data-intensive applications have become prevalent in todays information econ- omy. The sheer amount of data stored and utilized by todays web services presents unique challenges in the areas of scalability, security, and availability. This has opened new possibilities in data mining, allowing for more tightly integrated, in- formative services. It has also created new challenges. Traditional, monolithic, relational databases are inherently limited in terms of scalability. This has caused many leading companies to abandon traditional databases in favor of horizontally scalable data stores. This paper will evaluate the state of the art in data stor- age and retrieval, covering the history of the database and moving on to newer database technologies such as Googles Bigtable, Apache Cassandra, and Amazons DynamoDB. 1 Introduction Data storage and retrieval has become a central part of many popular web applications. As the amount of data available increases, the database capacity must scale up to meet it. Traditional methods of scaling up database capacity focus mainly on increasing the computing power of the single server on which the database resides. This strategy has been sufficient for many applications but has become infeasible for those that need to store more data than can be efficiently processed by one machine. Newer database paradigms emphasizing horizontal scalability, the ability to add as many nodes as are necessary and redistribute the data between all active nodes, have been growing in popularity. This increase in scalability does come at a cost. Many fea- tures that developers take for granted arent feasible on a horizontally scalable platform. For example, Googles Bigtable doesnt support many traditional querying operations. (e.g. joins) This means that substantial changes must be made to an application switch- ing from a traditional relational database. Another problem is that SQL as a standard meant that most databases behaved in more or less the same way. Newer technologies eschewing SQL have been categorized as NoSQL. (sometimes expanded to Not Only SQL) There is no set standard for what they support. This puts a lot of pressure on the 1 developer to make the right decision as the cost of switching from one NoSQL plat- form to another requires much more effort than a transition from one SQL database to another. This paper aims to provide enough information to make an intelligent decision on which technology is right for a given application as well as what the tradeoffs between scalability and ease of use have been made. 2 History of DBMS The data base management system (DBMS) specifications were published in the CO- DASYL Data Base Task Group’s 1971 report [40]. The first DBMS systems relied on tree-structured files and network models of data. These systems required applications to depend on the underlying structures, resulting in fragile applications that depended on artifacts of how data was stored rather than what that data was. This data depen- dence manifested itself in applications’ reliance on the existence of indexes (which were specified by name in application code) and in the order in which collections were persisted to disk. The desire for data independence gave rise to the idea of relational databases. The goal of the relational database was to increase the proportion of data representation characteristics that could ”be changed without logically impairing some application programs.” Relational databases made data normalization feasible. Normalization is the decomposition of all nonsimple domains into multiple simple domains. This has several advantages including deduplication of data, easier consistency checking, and aggregation. [28] Before the relational database, procedural data manipulation languages were used to retrieve data. This meant that the user had to manually navigate the data structures in order to retrieve the desired data. Relational databases opened up the possibility of declarative data manipulation languages. Declarative languages allow the user to spec- ify the results they are interested in and use the DBMS to translate the declarative query into the procedure for retrieving the data. The development of SQL, which is based on relational calculus, lead to the a de facto standardization of the database industry. [37] Modern relational databases provide many features that facilitate processing data while maintaining consistency. These consistency constraints can be summed up as ”atomicity, consistency, isolation, and durability (ACID).” [33] ACID properties make it very easy to develop applications that won’t leave the database in an inconsistent state. Unfortunately, enforcing these properties comes with quite a bit of overhead, limit concurrent operations by definition, and are not conducive to scaling horizontally. Scaling horizontally has become a necessity for processing the amounts of data that many of today’s Web 2.0 companies need to process. Scaling vertically is more expensive than adding more nodes and is fundamentally limited by the current state of the art in processors, memory, storage capability, and network capacity. This has led companies to increasingly abandon ACID and SQL in favor of more scalable technolo- gies, collectively grouped under the NoSQL flag. These NoSQL technologies are all different but most emphasize BASE (basically available, soft state, eventually consis- tent) [38, 41] which is much more conducive to performance but sacrifices much of 2 the precision of ACID. 3 Traditional Databases 3.1 Microsoft SQL Server Microsoft SQL Server was originally developed in coordination with Sybase, Inc. un- der the understanding that Microsoft would have ”exclusive rights to the DataServer product for OS/2 and all other Microsoft-developed operating systems.” [30] Version 1.0 shipped in 1989 and 1.1 shipped in 1990. In 1994, after Microsoft shipped Mi- crosoft SQL Server 4.2 for Windows NT, Microsoft and Sybase ended joint develop- ment and Microsoft SQL Server became a wholly Microsoft product [30]. There are three normal versions of Microsoft SQL Server 2012. Their Standard, Business Intelligence, and Enterprise versions all offer the same basic functionality but the more advanced versions offer more in the way of database management tools. The Enterprise version also includes features such as Multi-site and Geo-Clustering [21]. There is also a cloud-based version that Microsoft provides called SQL Azure. SQL Azure provides traditional SQL database access as a service billed monthly. Microsoft has also implemented a way to scale these databases horizontally using what they call Federations [23]. Utilizing federations adds to the complexity of application develop- ment as non-federated tables cannot have foreign key relationships with a federated table and columns cannot be guaranteed to be unique across federations [35]. 3.2 Oracle 11g Oracle database is an established enterprise DBMS provider with product licenses ranging from $47,500 per processer down to a free entry level version [13]. Oracle’s scalability packages revolve around clusters which are configured manually [11]. Oracle’s Relational database is geared at more traditional data sets. To handle ”Big Data,” they have released their own NoSQL Database that purports to scale horizontally while still supporting ACID transactions [12]. They also have their own toolchain for processing Big Data [9]. While Oracle is an established name with a solid reputation for performant, scal- able products, their pricing on scalable solutions is prohibitive to non-enterprise appli- cations [13]. 3.3 PostgreSQL PostgreSQL was originally designed as a successor to the INGRES DBMS. It was to support complex objects, allow for user extensibility of types, operators, and access methods as well as many other improvements with minimal changes to the relational model [39]. It is a free and open source (FOSS) database that ”is fully ACID compliant, has full support for foreign keys, joins, views, triggers, and stored procedures (in multiple languages).” [1] 3 3.4 MySQL MySQL is the traditional database component the LAMP (Linux, Apache, MySQL, Perl, Python, or PHP) open source web application stack [31]. MySQL was acquired by Oracle as part of their acquisition of Sun Microsystems in 2010 [10]. Since the acquisition, Oracle has been adding to Sun’s commercially licensed side of MySQL which has threatened to alienate their installed user base [32], many of whom weren’t happy about the initial acquisition [43]. MySQL supports user specification of Storage Engine at a table level [8] which allows users to optimize individual tables. One particular optimization supported by InnoDB, MySQL’s default storage engine, is that it is able to group commits so that there is only a single write to the log file, increasing write throughput [7]. 4 Google’s Bigtable Google developed the specification for a ”distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.” [25] They use it in-house for many of their services such as their web index, Google Earth, Google Analytics, etc. They also make it avail- able as a service to AppEngine users via their datastore API [3]. Because Google published the specification [25], there have been open source im- plementations, most notably Apache HBase. HBase utilizes the Hadoop Core in con- trast to the Google File System [4]. Many big web companies have started using HBase, including Facebook which uses it to ”power their Messages infrastructure.” [5, 36] Bigtable doesn’t support the traditional relational data model. It provides a simpler model and ”treats data as uninterpreted strings.” [25] Bigtable is essentially a sparse map distributed across all nodes with a complex key made up of a row identifier, col- umn identifier, and timestamp that maps to a string value. Rows are ordered lexico- graphically by row key. This means that applications can use similar row keys for data that is likely to be accessed sequentially in order take advantage of locality.