8/18/15

Stonebraker: da teoria para a prática bem sucedida em sistemas de bancos de dados (uma testemunha ocular dessa trajetória – Marta Mattoso)

Workshop on scientific data challenges August 2014, COPPE/UFRJ (Ilha do Fundão) Marta Mattoso Linha de BD, PESC, COPPE

1 8/18/15

Michael Stonebraker recebeu em junho, o prêmio Turing da ACM de 2014. “For fundamental contributions to the concepts and practices underlying modern systems.”

Stonebraker

Michael Stonebraker prêmio Turing da ACM de 2014. Ted Codd, 1970 “1 million funded by Google.”

Jim Gray, 1998 Outros premiados em BD

2 8/18/15

• Database systems are critical applications of “Brought Relational computing and preserve much of the world's Database Systems important data. from Concept • Stonebraker invented many of the concepts to Commercial that are used in almost all modern database Success systems. He demonstrated how to engineer database systems that support these concepts and released these systems as open software, Set the Research which ensured their widespread adoption. Agenda for the • Source code from Stonebraker’s systems can Multibillion-Dollar be found in many modern database Database Field for systems. Decades” • During a career spanning four decades, Stonebraker founded numerous companies successfully commercializing his pioneering database technology work

ACM Press Release

NO CENTRO DE MUDANÇAS DE TECNOLOGIA E POLÊMICAS As batalhas de Stonebraker

3 8/18/15

• Stonebraker developed Ingres, proving the viability of the theory. • Ingres was one of the first two relational database systems (the other was IBM System R). • Major contributions including query INGRES language design, query processing techniques, access methods, and concurrency control, and showed that query rewrite techniques (query modification) could be used to implement relational views and access control. Stonebraker timeline ACM Press Release

• Apostando no modelo relacional : Ingres Anos 70 • IBM contra-ataca com DB2 (System R) WW I • Perdendo a batalha para a Oracle contra a hegemonia dos fabricantes

4 8/18/15

• Open source DBMS • Object-relational model of database architecture integrating important ideas from object-oriented programming into the relational database context. • Concepts introduced in Ingres and PostGRES Postgres can be found in nearly all major database systems today. Ingres and Postgres were well engineered, built on UNIX, released as open software, and form the basis of many modern commercial database systems including Illustra, Informix, Netezza and Greenplum. Stonebraker timeline ACM Press Release

• Orientação a Objetos e SGBD-R • A guerra entre os OR e os OO • Desenvolvimento do Postgres- Anos 80 OR, Illustra • Combatendo pelos OR contra a hegemonia dos fabricantes de R puro

5 8/18/15

• Movimento código aberto Anos 90 • Postgres e variações • Otimização extensível, adaptável WW III contra a hegemonia dos líderes mercado SGBDR

D-INGRES • Lasting technical results on distributed query processing and transaction coordination protocols through development of Distributed Ingres, one of the first distributed database systems. • XPRS, a parallel version of Postgres XPRES that explored the “shared nothing” approach to parallel database management.

• Mariposa, a massively-distributed Mariposa federated database system, explored ideas such as opportunistic data replication and decentralized query processing.

Stonebraker timeline ACM Press Release

6 8/18/15

Aurora/StreamBase • real-time processing over streaming data sources -Store/Vertica • column-oriented storage architecture resulted in systems optimized for complex queries

• high throughput, distributed H-Store/VoltDB main-memory online transaction processing system SciDB • extreme-scale data management and data analysis system for science developed database architectures for specialized purposes

• “no size fits all” Anos 2000 • combatendo os “elefantes” WW III • OLAP (consultas analíticas) x OLTP (transações venda bilhetes) contra a hegemonia dos líderes mercado SGBDR

7 8/18/15

• Relational database management systems (DBMSs) have been remarkably successful in capturing the DBMS marketplace. To a first approximation they are “the only game in town,” and the major vendors (IBM, Oracle, and ) enjoy an overwhelming market share. • They are selling “one size fits all”; i.e., a single relational engine appropriate for all DBMS needs. • Moreover, the code line from all of the major vendors is quite elderly, in all cases dating from the 1980s. Hence, the major vendors sell software that is a quarter century old, and has been extended and morphed to meet today’s needs. MS: these legacy systems are at the end of their useful life. They deserve to be sent to the “home for tired software.” http://cacm.acm.org/blogs/blog-cacm/32212-the-end-of-a-dbms-era-might-be-upon-us

• NoSQL: Abandonar ACID não é uma boa ideia • SGBDR-P mais eficiente que Anos 2010 Hadoop/Hive etc. para bancos de dados

critica os sistemas NoSQL e ecossistema Hadoop

8 8/18/15

• NoSQL. There have been a variety of startups in the past few years that call themselves NoSQL vendors. Most claim extreme scalability and high performance, achieved through relaxing or eliminating transaction support and moving back to a low-level DBMS interface, thereby eliminating SQL. • In my opinion, these vendors have a couple of issues when presented with New OLTP. First, most New OLTP applications want real ACID. Replacing real ACID with either no ACID or “ACID lite” just pushes consistency problems into the applications where they are far harder to solve. Second, the absence of SQL makes queries a lot of work. In summary, NoSQL will translate into “lots of work for the application”—i.e., this will be the full employment act for programmers for the indefinite future. New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old--for-new-oltp-apps

We see the following trajectory in Hadoop data management: • Step 1: Adopt Hadoop for pilot projects. • Step 2: Scale Hadoop to production use. • Step 3: Observe an unacceptable performance penalty. • Step 4: Morph to a real parallel DBMS. MapReduce is an internal interface in a parallel DBMS, and one that is not well suited to the needs of a DBMS MapReduce is an internal interface in a parallel DBMS, and one that is not well suited to the needs of a DBMS

9 AComparisonofApproachestoLarge-ScaleDataAnalysis

Andrew Pavlo Erik Paulson Alexander Rasin Brown University University of Wisconsin Brown University [email protected] [email protected] [email protected] Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker YaleUniversity MicrosoftInc. M.I.T.CSAIL M.I.T.CSAIL [email protected] [email protected] [email protected] [email protected]

ABSTRACT model through which users can express relatively sophisticated dis- There is currently considerable enthusiasm around the MapReduce tributed programs, leading to significant interest in the educational (MR) paradigm for large-scale data analysis [17]. Although the community. For example, IBM and Google have announced plans basic control flow of this framework has existed in parallel SQL to make a 1000 processor MapReduce cluster available to teach stu- database management systems (DBMS) for over 20 years, some dents distributed programming. have called MR a dramatically new computing model [8, 17]. In Given this interest in MapReduce, it is natural to ask “Why not this paper, we describe and compare both paradigms. Furthermore, use a parallel DBMS instead?” Parallel database systems (which we evaluate both kinds of systems in terms of performance and de- all share a common architectural design) have been commercially velopment complexity. To this end, we define a benchmark con- available for nearly two decades, and there are now about a dozen in sisting of a collection of tasks that we have run on an open source the marketplace, including Teradata, Aster Data, Netezza, DATAl- version of MR as well as on two parallel DBMSs. For each task, legro (and therefore soon Microsoft SQL Server via Project Madi- we measure each system’s performance for various degrees of par- son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via allelism on a cluster of 100 nodes. Our results reveal some inter- the Database Partitioning Feature), and Oracle (via Exadata). They esting trade-offs. Although the process to load data into and tune are robust, high performance computing platforms. Like MapRe- the execution of parallel DBMSs took much longer than the MR duce, they provide a high-level programming environment and par- system, the observed performance of these DBMSs was strikingly allelize readily. Though it may seem that MR and parallel better. We speculate about the causes of the dramatic performance target different audiences, it is in fact possible to write almost any difference and consider implementation concepts that future sys- parallel processing task as either a set of database queries (possibly tems should take from both kinds of architectures. using user defined functions and aggregates to filter and combine data) or a set of MapReduce jobs. Inspired by this question, our goal is to understand the differences between the MapReduce approach Categories and Subject Descriptors to performing large-scale data analysis and the approach taken by H.2.4 [Database Management]: Systems—Parallel databases 8/18/15parallel database systems. The two classes of systems make differ- ent choices in several key areas. For example, all DBMSs require that data conform to a well-defined schema, whereas MR permits General Terms data to be in any arbitrary format. Other differences also include Database Applications, Use Cases, Database Programming how each system provides indexing and compression optimizations, programming models, the way in which data is distributed, and query execution strategies. 160 1. INTRODUCTION factor across all cluster scaling levels.Thepurpose Although the of relativethis paper per- is to consider these choices, and the formance of all systems degrade as both the number of nodes and 140 Recently the trade press has been filled with news of the rev- trade-offs that they entail. We begin in Section 2 with a brief review olution of “cluster computing”. This paradigm entailsthe total harnessing amount of data increase, Hadoop is most affected. For 120 example, there is almost a 50%of difference the two alternativein the execution classes time of systems, followed by a discussion large numbers of (low-end) processors working inbetween parallel the to 1 node solve and 10 nodein Section experiments. 3 of This the is architectural again due trade-offs. Then, in Section 4 we 100 acomputingproblem.Ineffect,thissuggestsconstructingadatato Hadoop’s increased start-uppresent costs as ourmore benchmark nodes are added consisting to of a variety of tasks, one taken 80 center by lining up a large number of low-end serversthe cluster, instead which takes of up a proportionatelyfrom the MR larger paper fraction [8], of and total the rest a collection of more demanding

seconds query time for short-running queries. 60 deploying a smaller set of high-end servers. WithAnother this rise important of in- reason fortasks. why the In parallel addition, DBMSs weare present able the results of running the benchmark 40 terest in clusters has come a proliferation of toolsto for outperform programming Hadoop is that bothon Verticaa 100-node and DBMS-X cluster use to an execute in- each task. We tested the publicly dex on the pageRank column and store the Rankings table already them. One of the earliest and best known 12.4 such tools in MapReduce 20 available open-source version of MapReduce, Hadoop [1], against ← 4.7

1.8 sorted by pageRank. Thus, executing this query is trivial. It should 0.8 0.3 ← ←

(MR) [8].← MapReduce is attractive because it provides a simple ← two parallel SQL DBMSs, Vertica [3] and a second system from a 0 also be noted that although Vertica’s absolute times remain low, its 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes relative performance degradesmajor as the number relational of nodes vendor. increases. We also present results on the time each This is in spite of the fact that eachsystem node took still executes to load the the query test in data and report informally on the pro- Vertica Hadoop the same amount of time (about 170ms). But because the nodes fin- AComparisonofApproachestoLarge-ScaleDataAnalysis cedures needed to set up and tune the software for each task. Figure 6: Selection Task Results ish executing the query so quickly, the system becomes flooded with Permission to make digital or hard copies of all or partcontrol of this messages work from for too many nodes,In general, which then the takes SQL a longer DBMSs were significantly faster and re- personal or classroom use is granted without fee providedtime for that the copies system are to process. Vertica uses a reliable message layer tomAndrew input Pavlo handlers in Hadoop; Erik Paulson the MR programs Alexander areRasin able to use quired less code to implement each task, but took longer to tune and Brown Universitynot made or University distributed of Wisconsin for profit Brown or University commercial advantagefor query and dissemination that copies and commit protocol processing [4], which Hadoop’[email protected] this [email protected] and the full citation [email protected] on onthe the first data page. To copy otherwise, to load the data. Hence, we conclude with a discussion on the reasons files to automatically split lines of text files into key/values pairs by we believe has considerable overhead when more than a few dozen Daniel J. Abadi Davidrepublish, J. DeWitt to post Samuel on servers Madden or to redist Michaelribute Stonebraker to lists, requiresnodes are prior involved specific in the query.for the differences between the approaches and provide suggestions YaleUniversitythe tab delimiter. MicrosoftInc. Again, we found M.I.T.CSAIL that other data format M.I.T.CSAIL options, [email protected] [email protected] and/or [email protected] a fee. [email protected] on the best practices for any large-scale data analysis engine. such as SequenceFileInputFormat or custom Writable 4.3.3 Aggregation Task ABSTRACTtuples, resultedSIGMOD’09, in both slowerJune loadmodel 29–July and through execution which 2, users2009, can times. express Providence, relatively sophisticated Rhode dis- Island, USA. Some readers may feel that experiments conducted using 100 tributed programs, leading to significant interest in the educational Our next task requires each system to calculate the total adRev- There is currently considerable enthusiasmCopyright around the MapReduce 2009 ACM 978-1-60558-551-2/09/06 ...$5.00. (MR) paradigm for large-scale data analysis [17]. Although the community. For example, IBM and Google have announced plans enue generated for each sourceIP in the UserVisits table (20GB/node), basic control flowDBMS-X: of this frameworkWe has existedused in the parallel same SQL loadingto make a 1000procedures processor MapReduce for DBMS-X cluster available toas teach stu- database management systems (DBMS) for over 20 years, some dents distributed programming. grouped by the sourceIP column. We also ran a variant of this query have called MRdiscussed a dramatically new in computing Section model 4.2. [8, 17]. The In RankingsGiven this table interest in was MapReduce, hash it partitionedis natural to ask “Why not this paper, we describeacross and comparethe cluster both paradigms. on pageURL Furthermore, anduse the a parallel data DBMS on eachinstead?” node Parallel was database sorted systems (which where we grouped by the seven-character prefix of the sourceIP col- we evaluate both kinds of systems in terms of performance and de- all share a common architectural design) have been commercially umn to measure the effect of reducing the total number of groups velopment complexity.by pageRank. To this end, we define Likewise, a benchmark the con- UserVisitsavailable for nearlytable two was decades, hash and there partitioned are now about a dozen in sisting of a collection of tasks that we have run on an open source the marketplace, including Teradata, Aster Data, Netezza, DATAl- on query performance. We165 designed this task to measure the per- version of MR ason well destinationURL as on two parallel DBMSs. and For each sorted task, bylegro visitDate (and therefore on soon each Microsoft node. SQL Server via Project Madi- we measure each system’s performance for various degrees of par- son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via formance of parallel analytics on a single read-only table, where allelism on a cluster of 100 nodes. Our results reveal some inter- the Database Partitioning Feature), and Oracle (via Exadata). They esting trade-offs.Vertica: Although theSimilar process to load to data DBMS-X, into and tune Verticaare robust, used high performancethe same computing bulkload platforms. com- Like MapRe- nodes need to exchange intermediate data with one another in order the execution of parallel DBMSs took much longer than the MR duce, they provide a high-level programming environment and par- compute the final value. Regardless of the number of nodes in the system, the observedmands performance discussed of these DBMSs in Section was strikingly 4.2 andallelize sorted readily. theThough UserVisits it may seem that MRand and Rank- parallel databases better. We speculate about the causes of the dramatic performance target different audiences, it is in fact possible to write almost any cluster, this tasks always produces 2.5 million records (53 MB); the difference and considerings tables implementation by the concepts visitDate that future sys- and pageRankparallel processing columns, task as either a respectively. set of database queries (possibly tems should take from both kinds of architectures. using user defined functions and aggregates to filter and combine variant query produces 2,000 records (24KB). data) or a set of MapReduce jobs. Inspired by this question, our goal is to understand the differences between the MapReduce approach • ImpalaCategories is architected andResults Subject & Descriptors Discussion: exactly Sincelike all the of results the ofshared-nothing loading the UserVisits parallel to performing large-scale data analysis and the approach taken by SQL Commands: The SQL commands to calculate the total adRev- SQLH.2.4 DBMSs, [Databaseand Management Rankingserving]: Systems— data theParallel sets data databases are warehouse similar,parallel we database only market. systems. provide The twoSpecifically, the classes results of systems for make differ-notice ent choices in several key areas. For example, all DBMSs require enue is straightforward: the MapReduceloading layer the larger has UserVisits been removed, datathat in data Figure conform and to 3. for a well-defined Just good as schema,with reason. whereasloading MR permits General Terms data to be in any arbitrary format. Other differences also include MapReduceDatabase Applications,the is Grep Use not Cases, 535MB/nodea Database useful Programming internal data set interface (Figurehow each system 1), provides inside the indexing loading a and SQL compression times (or optimizations, for Hive) SELECT sourceIP, SUM(adRevenue) programming models, the way in which data is distributed, and DBMS each system increases in proportionquery to execution the numberstrategies. of nodes used. FROM UserVisits GROUP BY sourceIP; 1. INTRODUCTION The purpose of this paper is to consider these choices, and the • GoogleRecently announced the trade press has been MapReduce filled with news of the rev-is yesterday'strade-offs that they entail. news We begin and in Section they 2 with a briefhave review olution of “cluster4.3.2 computing”. SelectionThis paradigm entails Task harnessing of the two alternative classes of systems, followed by a discussion The variant query is: movedlarge numbers on, of (low-end)building processors their working insoftware parallel to solve offeringsin Section 3 of the on architectural top trade-offs.of better Then, in systems Section 4 we acomputingproblem.Ineffect,thissuggestsconstructingadataThe Selection task is a lightweightpresent our filter benchmark to consisting find the of a varietypageURLs of tasks, one taken suchcenter as by liningDremmel up a large number, Big of low-end Table, servers instead and of F1/Spanner. In fact, Google must in the Rankings table (1GB/node)from with the MR a paper pageRank [8], and the rest above a collection a of moreuser- demanding SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) be "laughingdeploying a smaller setin of high-endtheir servers. beer" With thisabout rise of in- now.tasks. They In addition, invented we present the results MapReduce of running the benchmark to FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); terest in clustersdefined has come a proliferation threshold. of tools forFor programming our experiments,on a 100-node wecluster set to execute this each threshold task. We tested pa- the publicly supportthem. One the of the earliest Web and bestcrawl known such for tools intheir MapReduce searchavailable engine open-source in version 2004. of MapReduce, A few Hadoop [1],years against (MR) [8]. MapReducerameter is attractive to 10, because which it provides yields a simple approximatelytwo parallel SQL36,000 DBMSs, Vertica records [3] and a persecond data system from a ago they replaced MapReduce in thismajor application relational vendor. We alsowith present BigTable results on the time, each MapReduce Program: Unlike the previous tasks, the MR program file on each node. system took to load the test data and report informally on the pro- for this task consists of both a Map and Reduce function. The Map because they wanted an interactive storagecedures needed tosystem set up and tune and the software Map for each Reduce task. Permission to make digital or hard copies of all or part of this work for In general, the SQL DBMSs were significantly faster and re- function first splits the input valuebythefielddelimiter,andthen waspersonal batch-only. or classroom use is granted Hence, without fee providedthe thatdriving copies are applicationquired less code to implement behind each task, butMapReduce took longer to tune and not made or distributedSQL for profit Commands: or commercial advantageThe and thatDBMSs copies execute the selection task using the load the data. Hence, we conclude with a discussion on the reasons outputs the sourceIP field (given as the input key) and the adRev- movedbear this noticeto anda thebetter full citation onplatform the first page. To copy a otherwise, while to ago. Now Google is reporting republish, to post onfollowing servers or to redist simpleribute to lists, SQL requires prior statement: specific for the differences between the approaches and provide suggestions enue field as a new key/value pair. For the variant query, only the theypermission see and/or little-to-no a fee. future need for MapReduce.on the best practices for any large-scale data analysis engine. SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA. Some readers may feel that experiments conducted using 100 first seven characters (representing the first two octets, each stored Copyright 2009 ACM 978-1-SELECT60558-551-2/09/ pageURL,06 ...$5.00. pageRank • From the pointFROM of view Rankings of a WHEREparallel pageRank SQL/Hive > X; DBMS, HDFS is a as three digits) of the sourceIP are used. These two Map functions "fate worse than death." share the same Reduce function that simply adds together all of the 165 MapReduce Program: The MR program uses only a single Map adRevenue values for each sourceIP and then outputs the prefix and function that splits the input valuebasedonthefielddelimiterand revenue total. We also used MR’s Combine feature to perform the I wonder outputshow thelong record’s it will pageURL take and pageRankthe rest as a newof the key/value worldpre-aggregate to before data is transmitted to the Reduce instances, pair if its pageRank is above the threshold. This task does not re- improving the first query’s execution time by a factor of two [8]. follow Google'squire a Reduce direction function, since and each pageURL do likewise... in the Rankings data set is unique across all nodes. Results & Discussion: The results of the aggregation task experi- http://cacm.acm.org/magazines/2015/1/181612-a-valuable-lesson-and-whither-hadoopment in Figures 7 and 8 show once again that the two DBMSs out- Results & Discussion: As was discussed in the Grep task, the re- perform Hadoop. The DBMSs execute these queries by having each sults from this experiment, shown in Figure 6, demonstrate again node scan its local table, extract the sourceIP and adRevenue fields, that the parallel DBMSs outperform Hadoop by a rather significant and perform a local group by. These local groups are then merged at

172 10 8/18/15

• “The first one is one size no longer fits all, and that is going to give heartburn to the current [relational database] elephants. • The second one is "big data" means three different things and you've got to remember which one you're talking about. • The third one is that ACID [atomicity, consistency, isolation, durability] is a really good idea, so don't throw the baby out with the bathwater. • The fourth thing is, Think memory. It's the new disk. • The last thing is that -- in stages -- the cloud is really the answer to save money.” What do you think are some of the key trends shaping the world of data management in 2012?

• For undergraduates just getting into computer science, "Learn how to code, and code well, because whatever you do is going to involve implementation.”

• For Ph.D. students trying to figure out where to focus their attention, he suggests talking to real-world computer users. "They're happy to tell you why they don't like or do like any given technology, so they're a wonderful source of problems to work on."

Stonebraker's advice

11