What to Consider When Choosing Between Amazon Redshift and Google BigQuery Table of Contents Introduction .................................................................................................1 Data is Integral .....................................................................................1 Data Warehouse: A Brief Overview ............................................... 2 A Baseline for Data Warehouses .......................................................... 3 Context for our Research ................................................................. 3 Performance ............................................................................................... 6 Throughput............................................................................................6 Concurrency ......................................................................................... 8 Operations ..................................................................................................13 Provisioning ........................................................................................13 Loading .................................................................................................16 Maintenance .......................................................................................18 Security.................................................................................................22 Cost ..............................................................................................................23 Use Cases ...................................................................................................26 How Everlane Uses Amazon Redshift for Daily Projected Revenue ...............................................................26 How Reddit Uses Google BigQuery for Personalized Sales Pitches ...........................................................27 Conclusion ................................................................................................29 About Chartio ...........................................................................................30 References .................................................................................................31 Introduction 1 Introduction Data runs our world. It’s how startups track usage to determine product viability and how large enterprises determine quarterly performance. Making data-driven decisions is no longer a nice- to-have for companies, it’s a competitive requirement. Data is Integral Most companies run multiple operational databases like MySQL or PostgreSQL, track web analytics with Google Analytics and use Customer Relationship Management (CRM) systems like Salesforce—each of which collects volumes of data. Historically, this data is stored in separate silos, often as a mix of structured and unstructured data that must be transformed and combined before meaningful data analysis can be performed. As a company continues to amass even larger volumes of data, it may be time for them to evaluate data warehousing as a potential solution to one or more of the following challenges: 1. Data isolation/consolidation: Data needs to be collocated in order for analysis to cross application boundaries (e.g. Google Analytics visitor traffic and sales receipts). 2. Database suitability: Production systems tuned for concurrent latency-sensitive transactions may not be particularly efficient when queries touch the entire dataset. 3. Long-term retention: Storing many years of historical data may Introduction 2 be desirable, but it may not be practical to use the same systems as used for day-to-day operations. Data warehouses are not just for storing data, they must be architected to facilitate analytics and business reporting needs, handling complex analytical queries quickly without impacting operational databases or other systems where the data was originally created. Raw data on its own offers no insights and addresses few business goals, but by loading raw data into a data warehouse, it’s possible to facilitate data exploration, interactive reporting and data-informed decision-making for an entire organization. Data Warehouses: A Brief Overview Historically, data warehouses were clunky systems that took up physical space, needed a white-glove installation and required a team of database administrators to keep systems running smoothly. Today’s data warehouses are cloud-based, available on-demand and require significantly less upfront and ongoing expenses to maintain than legacy data warehouses. A wide range of businesses from SaaS startups to Fortune 500 companies are storing and analyzing massive amounts of data without any server, storage or networking of their own. By outsourcing management of those systems to cloud providers, they can focus employee efforts on analyzing data rather than keeping data centers running 24/7. A Baseline for Data Warehouses 3 Amazon Web Services (AWS) was an early pioneer of cloud- based data warehousing with Amazon Redshift. In the years since its launch many other players have entered this space and Amazon Redshift now has its share of competitors, notably Google Cloud Platform’s offering of Google BigQuery. In this whitepaper, we’ll examine the differences between these two data warehouse power players and examine what to consider when making your purchase decision. In particular, we’ll focus on performance, operations and cost. A Baseline for Data Warehouses There’s no one-size-fits-all data warehouse, but it’s crucial to choose a data warehouse that fits your business needs and will scale alongside your company. No matter which data warehouse you choose, they all have similarities, which can make it difficult when it comes to evaluation. Context for our Research We’re comparing Amazon Redshift and Google BigQuery because both systems: 1. Are marketed as fully-managed petabyte-scalable systems 2. Leverage parallel processing 3. Leverage columnar storage A Baseline for Data Warehouses 4 4. Are geared towards interactive reporting on large data sets 5. Support integrations and connections with various applications, including Business Intelligence tools Amazon Redshift Since launching in February 2013, Amazon Redshift has been one of the fastest-growing Amazon Web Service (AWS) offerings. Notedly, Amazon Redshift is based on PostgreSQL 8.0.2 and technology created by ParAccel, a database management system designed for advanced analytics for Business Intelligence. Since its launch, Amazon Redshift has added more than 130 significant features making it cloud- native data warehouse that’s different than ParAccel. Google BigQuery Google BigQuery first launched in public beta at the 2012 Google I/O conference as an offering within the Google Cloud Platform. Google BigQuery was initially created by developers as an internal tool called Dremel. Since launch, Google has made significant changes upon the internal tool and created a data warehouse that is used by both large enterprises as well as smaller companies. Google BigQuery takes full advantage of Google’s infrastructure and processing power and is an append-only system. Both Amazon Redshift and Google BigQuery employ similar columnar architectural concepts under the hood, but leverage Performance 5 different distributed systems to reach the same goal of supporting interactive reporting on large data sets. So let’s dive into the comparison and see where Amazon Redshift and Google BigQuery differ. Performance Interactive reporting is crucial for businesses. In enabling everyone to have access to data, it allows them to engage with information in new ways and ultimately, helps improve job performance. As a data analyst, empowering business users with interactive reporting frees up your time to perform more sophisticated analyses. However, with the addition of users querying data, it can amount to performance pressures for a data warehouse. Many performance commonalities between Amazon Redshift and Google BigQuery around scalability and speed are industry standards, so we’re focusing our analysis on throughput and concurrency. Throughput Throughput is the speed at which a data warehouse can perform queries. For those evaluating a data warehouse, throughput is a key area of consideration because customers want to ensure that they’re investing time, resources and money into a system that will adapt and perform to their Performance 6 needs. And yet, Amazon Redshift and Google BigQuery handle throughput in divergent ways. Amazon Redshift Amazon Redshift uses a distributed columnar architecture to minimize and parallelize the I/O hurdles that many traditional data warehouses come up against. Within the Amazon Redshift system, each column of a table is stored in data blocks with the goal of reducing I/O so only relevant data is retrieved from disks. Amazon Redshift is able reduce I/O through: • Column storage fetches data blocks only from the specific columns that are required for queries • Customers typically get 3 - 5X compression, which means less I/O and more effective storage • Zone maps: Each column is divided into 1MB data blocks; Redshift stores the min/max values of each block in memory and is able to identify the blocks that are required for a query • Direct-attached storage and large block sizes (1MB) that enable fast I/O With the release of Dense Storage (DS2) in June 2015, it allows for twice the memory and compute power of its predecessor (DS1) and the same storage capacity at the same cost, which leads the way for overall improvements to Amazon
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-