Amazon Redshift and the Case for Simpler Data Warehouses
Total Page:16
File Type:pdf, Size:1020Kb
Amazon Redshift and the Case for Simpler Data Warehouses Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, Vidhya Srinivasan Amazon Web Services Abstract the past 12-18 months, new market research has begun to show an Amazon Redshift is a fast, fully managed, petabyte-scale data increase to 50-60%, with data doubling in size every 20 months. warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service’s (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift’s pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing 1990 2000 2010 2020 technologies and many commercial data warehousing engines that Enterprise Data Data in Warehouse provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing Figure 1: Data Analysis Gap in the Enterprise [10] engines. And, while Amazon Redshift is available in the AWS This implies most data in an enterprise is “dark data”2: data that is cloud, one can build data warehouses using EC2 instances and the collected but not easily analyzed. We see this as unfortunate. If database engine of one’s choice with either local or network- our customers didn’t see this data as having value, they would not attached storage. retain it. Many companies are trying to become increasingly data- In this paper, we discuss an oft-overlooked differentiating driven. And yet, not only is most data already dark, the overall characteristic of Amazon Redshift – simplicity. Our goal with data landscape is only getting darker. Storing this data in NoSQL Amazon Redshift was not to compete with other data warehousing stores and/or Hadoop is one way to bridge the gap for certain use engines, but to compete with non-consumption. We believe the cases. However it doesn’t address all scenarios. vast majority of data is collected but not analyzed. We believe, In our discussions with customers, we found the “analysis gap” while most database vendors target larger enterprises, there is between data being collected and data available for analysis as little correlation in today’s economy between data set size and due to four major causes. company size. And, we believe the models used to procure and consume analytics technology need to support experimentation 1. Cost – Most commercial database solutions capable of and evaluation. Amazon Redshift was designed to bring data analyzing data at scale require significant up-front expense. warehousing to a mass market by making it easy to buy, easy to This is hard to justify for large datasets with unclear value. tune and easy to manage while also being fast and cost-effective. 2. Complexity – Database provisioning, maintenance, backup, and tuning are complex tasks requiring specialized skills. 1. Introduction They require IT involvement and cannot easily be performed Many companies augment their transaction-processing database by line of business data scientists or analysts. systems with data warehouses for reporting and analysis. Analysts estimate the data warehouse market segment at 1/3 of the overall 3. Performance – It is difficult to grow a data warehouse relational database market segment ($14B vs. $45B for software without negatively impacting query performance. Once built, licenses and support), with an 8-11% compound annual growth IT teams sometimes discourage augmenting data or adding rate (CAGR)1. While this is a strong growth rate for a large, queries as a way of protecting current reporting SLAs. mature market, over the past ten years, analysts also estimate data storage at a typical enterprise growing at 30-40% CAGR. Over 4. Rigidity – Most databases work best on highly structured relational data. But a large and increasing percentage of data consists of machine-generated logs that mutate over time, audio and video, not readily accessible to relational analysis. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are We see each of the above issues only increasing with data set size. not made or distributed for profit or commercial advantage and that copies To take one large-scale customer example, the Amazon Retail bear this notice and the full citation on the first page. Copyrights for third- team collects about 5 billion web log records daily (2TB/day, party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). SIGMOD '15, May 31–June 4, 2015, Melbourne, Victoria, Australia 1 Forecast data combined from IDC, Gartner, and 451 Research. ACM 978-1-4503-2758-9/15/05. 2 http://www.gartner.com/it-glossary/dark-data http://dx.doi.org/10.1145/2723372.2742795 1917 growing 67% YoY). It is very valuable to combine this data with sampling. We also avoid the use of indexing or projections, transactional data stored in their relational and NoSQL systems. instead favoring multi-dimensional z-curves. Using an existing scale-out commercial data warehouse, they were able to analyze 1 week of data per hour and maintain a cap Time to Deploy and Manage a Cluster of 15 months of log, using the largest configuration available. Using much larger Hadoop clusters, they were able to analyze up 2 to 1 month of data per hour, though these clusters were very Nodes expensive to administer. 16 By augmenting their existing systems with a petabyte-scale Nodes Amazon Redshift cluster, the Amazon Enterprise Data Warehouse team was able to perform their daily load (5B rows) in 10 128 minutes, load a month of backfill data (150B rows) in 9.75 hours, Nodes take a backup in 30 minutes and restore it to a new cluster in 48 hours. Most interestingly, they were able to now run queries that 0 4 8 121620242832 joined 2 trillion rows of click traffic with 6 billion rows of product Duration (Minutes) ids in less than 14 minutes, an operation that didn’t complete in Time spent Deploy Connect Backup Restore Resize over a week on their existing systems. on clicks (2 to 16 nodes) Some of the above results emerge from Amazon Redshift’s use of, by now, familiar data warehousing techniques, including Figure 2: Common admin operation execution time by size columnar layout, per-column compression, co-locating compute Amazon Redshift also obtains a number of advantages by and data, co-locating joins, compilation to machine code and operating as a managed service in the AWS cloud. scale-out MPP processing. While these techniques have not fully made their way into the largest data warehousing vendors, First and foremost, we are able to leverage the pricing and scale of Amazon Redshift’s approach is very close to that taken by other Amazon EC2 for our instances, the durability and throughput of columnar MPP data warehousing engines. Amazon S3 for backup, and the security of Amazon VPC, to name just a few. This enables our customers to benefit from the Beyond these techniques used to scale out query processing and to innovation and price reductions delivered across a much larger reduce CPU, disk IOs, network traffic, Amazon Redshift had a team than ours alone. number of additional design goals: Second, by making deployments and patching automatic and 1. Minimize cost of experimentation – It is difficult to painless for our customers, we are able to deploy software at a understand whether data is worth analyzing without high frequency. We have averaged the addition of one feature per performing at least a few experiments. Amazon Redshift week, over the past two years. Reducing the cost of deployment provides a free trial allowing customers to evaluate the and the rapid feedback mechanisms available in running a service service for 60 days using up to 160GB of compressed SSD allows us to run a more agile OODA3 loop compared to on- data at no charge. For subsequent experiments, they can spin premise providers, ensuring we quickly provide the features our up a cluster with no commitments for $0.25/hour/node. This customers care most about. is inclusive of hardware, software, maintenance and management. Finally, since we manage a fleet of thousands of Amazon Redshift clusters, we benefit from automating capabilities that would not 2. Minimize time to first report – We measure the time it takes make economic sense for any individual on-premise DBA. An our customers to go from deciding to create a cluster to operation requiring one minute of per-node overhead is worth seeing the results of their first query. As seen in Figure 2, this many days of investment for us to remove. Removing the root can be as little as 15 minutes, even to provision a multi-PB cause of a defect that causes an individual cluster to fail and cluster. restart once per thousand days is well worth our investment to reduce paging. As we scale, we have to continually improve the 3. Minimize administration – Amazon Redshift tries to reduce reliability, availability and automation across our fleet to manage the undifferentiated heavy lifting of creating and operating a operations, which makes failures increasingly uncommon for our data warehouse. We automate backup, restore, provisioning, customers. patching, failure detection and repair. Advanced operations like encryption, cluster resizing and disaster recovery require In this paper, we discuss Amazon Redshift’s system architecture, only a few clicks to enable.