Brodie Book Introduction Final 050218
Total Page:16
File Type:pdf, Size:1020Kb
Introduction Michael L. Brodie, Computer Systems and Artificial Intelligence Laboratory, Massachusetts Institute of Technology Our story begins at University of California, Berkeley in 1971, in the heat of the Vietnam War. A tall, ambitious, newly minted assistant professor with an EE Ph.D. in Markov chains asks a seasoned colleague for direction in shaping his nascent career for more impact than he could imagine possible with Markov chains.1 Professor Eugene Wong suggests that Michael Stonebraker read Ted Codd’s just- published paper on a striking new idea: relational databases. Upon reading the paper, Mike is immediately convinced of the potential, even though he knows essentially nothing about databases. He sets his sights on that pristine relational mountaintop. The rest is history. Or rather, it is the topic of this book. Ted’s simple, elegant, relational data model and Mike’s contributions to making databases work in practice helped forge what is today a $55 billion industry. But, wait, I’m getting ahead of myself. A Brief History of Databases What's a database? What is data management? How did they evolve? Imagine accelerating by orders of magnitude the discovery of cancer causes and the most effective treatments to radically reduce the 10M annual deaths worldwide. Or enabling autonomous vehicles to profoundly reduce the 1M annual traffic deaths worldwide, while reducing pollution, traffic congestion, and real estate wasted on vehicles that on average are parked 97% of the time. These are merely two examples of the future potential positive impacts of using Big Data. As with all technical advances, there is also potential for negative impacts, both unintentional—in error—and intentional such as undermining modern democracies (allegedly well under way). Using data means managing data efficiently at scale; that’s what data management is for. In its May 6, 2017 issue, The Economist declared data to be the world’s most valuable resource. In 2012, data science—often AI-driven analysis of data at scale—exploded on the world stage. This new, data-driven discovery paradigm may be one of the most significant advances of the early 21st century. Non-practitioners are always surprised to find that 80% of the resources required for a data science project are devoted to data management. The surprise stems from the fact that data management is an infrastructure technology: basically, unseen plumbing. In business, data management “just gets done” by mature, robust database management systems (DBMSs). But data science poses new, significant data management challenges that have yet to be understood let alone addressed. The above potential benefits, risks, and challenges herald a new era in the development of data management technology, the topic of this book. How did it develop in the first place? What were the previous eras? What challenges lie ahead? This introduction briefly sketches answers to those questions. This book is about the development and ascendency of data management technology enabled by the contributions of Michael Stonebraker and his collaborators. Novel when created in the 1960s, data management became the key enabling technology for businesses of all sizes worldwide, leading to today’s $55B2,3 DBMS market and tens of millions of operational databases. The average Fortune 100 company has more than 5,000 operational databases, supported by tens of DBMS products. 1 “I had to publish, and my thesis topic was going nowhere.”—Mike Stonebraker 2 https://www.statista.com/statistics/724611/worldwide-database-market/ 3 Numbers quoted in this chapter are as of mid-2018. 1 Databases support your daily activities, such as securing your banking and your credit card transactions. So that you can buy that Starbucks latte, Visa, the leading global payments processor, must be able to simultaneously process 50,000 credit card transactions per second. These “database” transactions update not just your account and those of 49,999 other Visa cardholders, but also those of 50,000 creditors like Starbucks, while simultaneously validating you, Starbucks, and 99,998 others for fraud, no matter where your card was swiped on the planet. Another slice of your financial world may involve one of the more than 3.8B trade transactions that occur daily on major U.S. market exchanges. DBMSs support such critical functions not just for financial systems, but also for systems that manage inventory, air traffic control, supply chains, and all daily functions that depend on data and data transactions. Databases managed by DBMSs are even on your wrist and in your pocket if you, like 2.3B others on the planet, use an iPhone or Android smartphone. If databases stopped, much of our world would stop with them. A database is a logical collection of data, like your credit card account information, often stored in records that are organized under some data model, such as tables in a relational database. A DBMS is a software system that manages databases and ensures persistence—so data is not lost—with languages to insert, update, and query data, often at very high data volumes and low latencies, as illustrated above. Data management technologies have evolved through four eras over six decades, as is illustrated in the Data Management Technology Kairometer. In the inaugural navigational era (1960s), the first DBMSs emerged. In them, data, such as your mortgage information, was structured in hierarchies or networks and accessed using record-at-a-time navigation query languages. The navigational era gained a database Turing Award in 1973 for Charlie Bachman’s “outstanding contributions to database technology.” In the second, relational era (1970s-1990s), data was stored in tables accessed using a declarative, set-at-a-time query language, SQL: for example, “Select name, grade From students in Engineering with a B average.” The relational era ended with approximately 30 commercial DBMSs dominated by Oracle’s Oracle, IBM’s DB2, and Microsoft’s SQL Server. The relational era gained two database Turing Awards: one in 1981 for Ted Codd’s “fundamental and continuing contributions to the theory and practice of database management systems, esp. relational databases” and one in 1998 for Jim Gray’s “seminal contributions to database and transaction processing research and technical leadership in system implementation.” The start of Mike Stonebraker’s database career coincided with the launch of the relational era. Serendipitously, Mike was directed by his colleague, UC Berkeley professor Eugene Wong, to the topic of data management and to the relational model via Ted’s paper. Attracted by the model’s simplicity compared to that of navigational DBMSs, Mike set his sights, and ultimately built his career, on making relational databases work. His initial contributions, Ingres and Postgres, did more to make relational databases work in practice than those of any other individual. After more than 30 years, Postgres—via PostgreSQL and other derivatives—continues to have a significant impact. PostgreSQL is the third4 or fourth5 most popular (used) of hundreds of DBMSs, and all relational DBMSs implement the object- relational data model and features introduced in Postgres. In the first relational decade, researchers developed core RDBMS capabilities including query optimization, transactions, and distribution. In the second relational decade, researchers focused on high- performance queries for data warehouses (using column stores) and high-performance transactions and real time analysis (using in-memory databases); extended RDBMSs to handle additional data types and processing (using abstract data types); and tested the “one-size-fits-all” principle by fitting myriad application types into RDBMSs. But complex data structures and operations—such as those in geographic information, graphs, and scientific processing over sparse matrices—just didn’t fit. Mike knew because he pragmatically exhausted that premise using real, complex database applications to push RDBMS limits, eventually concluding that “one-size-does-not-fit-all” and moving to special-purpose databases. With the rest of the database research community, Mike turned his attention to special-purpose databases, launching the third, non-relational era (2000-2010). Researchers in the non-relational era 4 https://www.eversql.com/most-popular-databases-in-2018-according-to-stackoverflow-survey/ 5 https://db-engines.com/en/ranking 2 developed data management solutions for non-relational data and associated processing (e.g., time series, semi- and un-structured data, key-value data, graphs, documents) in which data was stored in special-purpose forms and accessed using non-SQL query languages, called NoSQL, of which Hadoop is the best-known example. Mike pursued non-relational challenges in the data manager but, claiming inefficiency of NoSQL, continued to leverage SQL’s declarative power to access data using SQL-like languages, including an extension called NewSQL. New DBMSs proliferated, due to the research push for and application pull of specialized DBMSs, and the growth of open source software. The non-relational era ended with more than 350 DBMSs, split evenly between commercial and open source and supporting specialized data and related processing including (in order of utilization): relational, key-values, documents, graphs, time series, RDF, objects (object-oriented), search, wide columns, multi-value/dimensional, native XML, content (e.g., digital, text, image), events, and navigational. Despite the choice and diversity