What to Consider When Choosing Between Amazon Redshift and BigQuery Table of Contents

Introduction...... 1

Data is Integral...... 1

Data Warehouse: A Brief Overview...... 2

A Baseline for Data Warehouses...... 3

Context for our Research...... 3

Performance...... 6

Throughput...... 6

Concurrency...... 8

Operations...... 13

Provisioning...... 13

Loading...... 16

Maintenance...... 18

Security...... 22

Cost...... 23

Use Cases...... 26

How Everlane Uses Amazon Redshift for Daily Projected Revenue...... 26

How Reddit Uses Google BigQuery for Personalized Sales Pitches...... 27

Conclusion...... 29 About Chartio...... 30

References ...... 31 Introduction 1

Introduction

Data runs our world. It’s how startups track usage to determine product viability and how large enterprises determine quarterly performance. Making data-driven decisions is no longer a nice- to-have for companies, it’s a competitive requirement.

Data is Integral

Most companies run multiple operational databases like MySQL or PostgreSQL, track web analytics with and use Customer Relationship Management (CRM) systems like Salesforce—each of which collects volumes of data. Historically, this data is stored in separate silos, often as a mix of structured and unstructured data that must be transformed and combined before meaningful data analysis can be performed.

As a company continues to amass even larger volumes of data, it may be time for them to evaluate data warehousing as a potential solution to one or more of the following challenges:

1. Data isolation/consolidation: Data needs to be collocated in order for analysis to cross application boundaries (e.g. Google Analytics visitor traffic and sales receipts).

2. Database suitability: Production systems tuned for concurrent latency-sensitive transactions may not be particularly efficient when queries touch the entire dataset.

3. Long-term retention: Storing many years of historical data may Introduction 2

be desirable, but it may not be practical to use the same systems as used for day-to-day operations.

Data warehouses are not just for storing data, they must be architected to facilitate analytics and business reporting needs, handling complex analytical queries quickly without impacting operational databases or other systems where the data was originally created.

Raw data on its own offers no insights and addresses few business goals, but by loading raw data into a data warehouse, it’s possible to facilitate data exploration, interactive reporting and data-informed decision-making for an entire organization.

Data Warehouses: A Brief Overview

Historically, data warehouses were clunky systems that took up physical space, needed a white-glove installation and required a team of database administrators to keep systems running smoothly. Today’s data warehouses are cloud-based, available on-demand and require significantly less upfront and ongoing expenses to maintain than legacy data warehouses.

A wide range of businesses from SaaS startups to Fortune 500 companies are storing and analyzing massive amounts of data without any server, storage or networking of their own. By outsourcing management of those systems to cloud providers, they can focus employee efforts on analyzing data rather than keeping data centers running 24/7. A Baseline for Data Warehouses 3

Amazon Web Services (AWS) was an early pioneer of cloud- based data warehousing with Amazon Redshift. In the years since its launch many other players have entered this space and Amazon Redshift now has its share of competitors, notably ’s offering of Google BigQuery.

In this whitepaper, we’ll examine the differences between these two data warehouse power players and examine what to consider when making your purchase decision. In particular, we’ll focus on performance, operations and cost.

A Baseline for Data Warehouses

There’s no one-size-fits-all data warehouse, but it’s crucial to choose a data warehouse that fits your business needs and will scale alongside your company. No matter which data warehouse you choose, they all have similarities, which can make it difficult when it comes to evaluation.

Context for our Research

We’re comparing Amazon Redshift and Google BigQuery because both systems:

1. Are marketed as fully-managed petabyte-scalable systems

2. Leverage parallel processing

3. Leverage columnar storage A Baseline for Data Warehouses 4

4. Are geared towards interactive reporting on large data sets

5. Support integrations and connections with various applications, including Business Intelligence tools

Amazon Redshift

Since launching in February 2013, Amazon Redshift has been one of the fastest-growing Amazon Web Service (AWS) offerings. Notedly, Amazon Redshift is based on PostgreSQL 8.0.2 and technology created by ParAccel, a database management system designed for advanced analytics for Business Intelligence. Since its launch, Amazon Redshift has added more than 130 significant features making it cloud- native data warehouse that’s different than ParAccel.

Google BigQuery

Google BigQuery first launched in public beta at the 2012 Google I/O conference as an offering within the Google Cloud Platform. Google BigQuery was initially created by developers as an internal tool called Dremel. Since launch, Google has made significant changes upon the internal tool and created a data warehouse that is used by both large enterprises as well as smaller companies. Google BigQuery takes full advantage of Google’s infrastructure and processing power and is an append-only system.

Both Amazon Redshift and Google BigQuery employ similar columnar architectural concepts under the hood, but leverage Performance 5 different distributed systems to reach the same goal of supporting interactive reporting on large data sets. So let’s dive into the comparison and see where Amazon Redshift and Google BigQuery differ.

Performance

Interactive reporting is crucial for businesses. In enabling everyone to have access to data, it allows them to engage with information in new ways and ultimately, helps improve job performance.

As a data analyst, empowering business users with interactive reporting frees up your time to perform more sophisticated analyses. However, with the addition of users querying data, it can amount to performance pressures for a data warehouse.

Many performance commonalities between Amazon Redshift and Google BigQuery around scalability and speed are industry standards, so we’re focusing our analysis on throughput and concurrency.

Throughput

Throughput is the speed at which a data warehouse can perform queries. For those evaluating a data warehouse, throughput is a key area of consideration because customers want to ensure that they’re investing time, resources and money into a system that will adapt and perform to their Performance 6 needs. And yet, Amazon Redshift and Google BigQuery handle throughput in divergent ways.

Amazon Redshift

Amazon Redshift uses a distributed columnar architecture to minimize and parallelize the I/O hurdles that many traditional data warehouses come up against. Within the Amazon Redshift system, each column of a table is stored in data blocks with the goal of reducing I/O so only relevant data is retrieved from disks. Amazon Redshift is able reduce I/O through:

• Column storage fetches data blocks only from the specific columns that are required for queries

• Customers typically get 3 - 5X compression, which means less I/O and more effective storage

• Zone maps: Each column is divided into 1MB data blocks; Redshift stores the min/max values of each block in memory and is able to identify the blocks that are required for a query

• Direct-attached storage and large block sizes (1MB) that enable fast I/O

With the release of Dense Storage (DS2) in June 2015, it allows for twice the memory and compute power of its predecessor (DS1) and the same storage capacity at the same cost, which leads the way for overall improvements to Amazon Redshift. Since this release, Amazon has worked to improve Amazon

Performance 7

Redshift’s throughput by 2X every six months. Additionally, Amazon Redshift has improved vacuuming performance by 10X.

For those working with a wide variety of workloads and a spectrum of query complexities, this is a welcome improvement to the Amazon Redshift system.

Google BigQuery

Similarly, Google BigQuery relies on Dremel and Colossus (the successor to the ) for its storage, compute and memory. These systems are then tied together by Google’s Petabit Jupiter Network, which allows every node to communicate with other nodes at 10G. In leveraging this architecture, one of the major benefits is bottleneck-free scaling and concurrency.

In March 2016, Google BigQuery released Capacitor, an updated and state-of-the-art storage format system within the data warehouse. Capacitor is designed to take an opinionated approach to storage in allowing background processes to constantly evaluate customers’ usage patterns and automatically optimized these datasets to improve performance. This process is entirely automated and transparent for end users, which allows queries to run at a faster rate because it adapts to a user’s Google BigQuery environment. For further information on Google BigQuery’s throughput, you can read their Medium post.

Performance 8

Concurrency

Every data warehouse has concurrency limitations, or the maximum number of queries you can run simultaneously without leading to slowness in generating interactive reports. Having said that, you may come up against issues around concurrency when democratizing data access for users who explore data within pre-existing dashboards via a Business Intelligence tool. That, coupled with the data warehouse simultaneously having to ingest new data streams for reporting can be taxing on the overall system.

While this may seem like a simple maneuver each of these can easily generate a large number of queries with varied resource requirements, thus running up against the concurrency limitations and a lag in generating reports.

Ideally, all queries could operate without any contention for resources, under the hood, every data warehouse has resource constraints and thus practical limits on concurrent workload capabilities.

Amazon Redshift

As an Amazon Redshift administrator, in the Management console you’re able to set the concurrency limit for your Amazon Redshift cluster. While the concurrency limit is 50 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit Performance 9 for you business. It’s important to note that you’re able to also have a maximum of 500 concurrent connections per cluster. Meaning, queries from up to 500 users will get executed with up to 50 queries at any given time.

Having said that, Amazon’s documentation recommends only running 15 - 25 queries at a time for optimal throughput.

Additionally, Amazon Redshift allows users to adjust their Workload Management (number of slots/queries, queues, memory allocation and time out) to best address their needs.

This is an important aspect of Amazon Redshift’s ability to provide interactive reports to its customers. Whether your company is enabling data exploration by allowing business users to run their own queries via an interactive BI tool or you’re fielding ad hoc query requests, it’s crucial to understand Amazon Redshift’s concurrency limitation as this will impact the speed at which you receive results.

Google BigQuery

On the other hand, Google BigQuery isn’t immune to the 50 query concurrency limit. Although not a traditional MPP database, Google BigQuery still uses MPP concepts, only on a much larger scale. It introduces a few innovations that workaround the MPP limits, in particular concurrency limits:

• Separating compute and storage: Historically, MPP

Performance 10

systems couple compute and storage in order to avoid the performance hit of fetching data across the network.

However, this limits the system’s ability to scale compute up and down based on the number of queries being processed. Google BigQuery separates compute and storage so that the system as a whole can scale compute to match the current query load at the cost of potentially increased network traffic for any particular query.

• Using an order of magnitude of more hardware: Unlike a traditional MPP system where administrators would have to plan the number of compute nodes a cluster should have ahead of time, Google dynamically allocates a varying number of servers. Where Amazon Redshift has a node limit of 128, Google is potentially using thousands of servers to process the query.

A downside to having more hardware means that Google provides this feature to all its users, meaning it’s a shared, multi-tenant system with other Google BigQuery users.

Conclusion

At the end of the day, throughput and concurrency impact one another and both impact the overall performance of a data warehouse. For users, they all want the same thing: a highly performant data warehouse that will meet their needs and allow everyone to access data. Operations 11

In enabling everyone to access data, it generates questions, leads to crucial insights and drives better decision-making that wouldn’t be possible without a data warehouse.

Amazon Redshift is currently being used successfully by many companies such as NASDAQ, Soundcloud, theSkimm and Clever to provide interactive analytics. As an extremely agile, heavily documented and well understood system with a large community base, knowing the strengths and weaknesses of the system allows architects and engineers to maintain it appropriately within their data infrastructure.

Google BigQuery although used by enterprise sized companies such as The New York Times, Spotify and Zulily to provide flexible analytics at scale lacks the robust documentation and community that follows Amazon Redshift, which can make it a bit difficult to resolve issues when they appear. Though in the past few months Google has been working to improve Google BigQuery, particularly around cost transparency and moving towards Standard SQL.

Operations

Data warehouses are architected to handle a large volume of data. In fact, many companies use data warehouses to store historical data going back at least three years—and this is a great practice when it comes to enriching information for a target persona or running product usage analysis. Operations 12

Provisioning

Data provisioning is the process of making data available in an organized and secure way to users and third-party applications that need access to the data.

Amazon Redshift

According to its documentation, “an Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.”

In architecting your Amazon Redshift, you must first provision a cluster made up of one or more nodes, which is handled seamlessly by the table distribution styles. Currently, Amazon offers four types of nodes when setting up your Amazon Redshift, you’ll need to launch a cluster and Amazon automates the rest of the process. The number of nodes that you choose is based on the size of your dataset and query performance. Many

Amazon Redshift users see cluster provisioning and choosing node types as a means of greater flexibility, allowing them to use different node types for different needs for optimal performance.

Because Amazon Redshift is architected to meet the needs of its users, Amazon provides an elasticity option of adding and subtracting nodes when necessary. Amazon users may want to resize a cluster when they have more data or they may want Operations 13 to size down their cluster if their business needs and data changes. Luckily, since Amazon automates the entire resizing process, it only takes a few minutes to complete the process.

Although a cluster is only available as read-only when resizing, Amazon allows users to run a parallel environment of the same data and continue to run analytics on the replicated cluster without impacting the production cluster or ongoing workload.

Google BigQuery

For Google BigQuery, configuring clusters and nodes is not written in their architecture. Rather, data is provisioned into and queries are run from those tables directly.

According to Google’s blog post, “We have customers currently running queries that scan multiple petabytes of data or tens of trillions of rows using a simple SQL query, without ever having to worry about system provisioning, maintenance, fault- tolerance or performance tuning.”

As of September 2016, Google recently released a flat-rate pricing model giving high-volume, enterprise customers a more stable monthly cost for queries rather than an on-demand model of data processed. The flat-rate pricing starts at a monthly cost of $40,000 with an unlimited amount of queries. It’s worth noting that with this pricing model, storage is priced separately.

For many smaller companies, forking over $40,000 per month Operations 14 for unlimited queries may not be an option. While provisioning data may seem easier in Google BigQuery, it’s at the cost of limited predictability in monthly pricing and holding a quota threshold to manage a cost effective system.

Conclusion

If your company is willing to build a data warehouse infrastructure upfront and are able to proactively think about data organization, then Amazon Redshift is the clear choice. With various types of nodes available with huge storage capacities, Amazon Redshift is a robust system that is used by enterprise-grade and scaling startups.

Google BigQuery has taken an entirely different approach to provisioning by forgoing organization because they’re able to have an order of magnitude of hardware that is unlike their competitors. This architecture may be appealing to companies that want to dive into the deep end first and sacrifice cost, future data migration and infrastructure. While this may seem alluring to companies, the cost factor should not be ignored for lean companies.

Choosing between the two systems may come down to deciding whether you have a preference and the ability to provision data upfront or forgo the entire process altogether.

Operations 15

Loading

In evaluating a data warehouse, it’s important to consider how data loads from your database into the data warehouse. Second, it’s critical to also consider the speed, accessibility and latency once data is loaded into the data warehouse.

Much like how cloud-based data warehouses revolutionized warehousing, a solution to data loading can be found through an ETL service via companies such as Treasure Data, Segment, Stitch, or an in-house built ETL. However, there are differences in loading data into Amazon Redshift and Google BigQuery.

Amazon Redshift

If you’re a current AWS customer Amazon has optimized methods for loading data into Amazon Redshift from the other AWS services that you’re already using.

The most optimized way users load their data is from Amazon S3 and then use the COPY command to parallel load into Amazon Redshift. In addition to Amazon S3 and Elastic MapReduce, you can also load data from DynamoDB, any SSH-enabled host or via PostgreSQL wire protocol into Amazon Redshift.

In loading data, Amazon Redshift ingest performance continues to scale alongside the size of your cluster. Thus, using the COPY command leverages the MPP architecture to read and load data in parallel from in an Amazon S3 bucket. Operations 16

Additionally, Amazon Redshift has a robust partner ecosystem with ETL (extract, transform and load) tools where they will push data from your production database into Amazon Redshift.

Google BigQuery

Likewise, if you’re a Google Cloud Platform customer, data loading between Google platforms is already built-in. So, if you’re already a customer of the Google ecosystem, loading is quite seamless from data source to Google BigQuery. According to Google documentation, users can load data directly from a readable data source and by inserting individual records using streaming inserts.

An advantage for customers using Google BigQuery is the ability to have federated access to data sources. Meaning, customers can store their data in or and query from those data sources within Google BigQuery— without having to load the data into the data warehouse first.

Further, there are a few differences when using Google Analytics. Once your website’s data grows to such a large volume, Google will limit your access of data to a sample set, thus limiting your ability to run performant queries. As a workaround, users can upgrade their Google Analytics account to Premium at the flat-rate annual cost of $150,000 and included in this upgrade is an automatic sync from Google Analytics into Google BigQuery. Operations 17

Conclusion

If you’re already heavily invested in one ecosystem—whether it’s AWS or Google Cloud Platform—it’s most cost effective to remain a current customer of the said ecosystem. AWS and Google Cloud Platform are platform businesses who offer a variety of systems and are marketed as a consolidator of tools into a single vendor.

Additionally, should you choose to move data warehouses, it’s important to note that it’s going to be difficult to get your data out of Amazon Redshift and into Google BigQuery and vice versa.

Maintenance

With a cloud-based data warehouse, there’s no physical infrastructure to manage, allowing for a streamlined focus on analytics and insights, rather than hours of manual maintenance. But, like any system, every data warehouse needs to undergo maintenance for a tune up from time to time.

Amazon Redshift

As a data warehouse built with MPP concepts, Amazon Redshift requires periodic maintenance which makes the system run faster. However, this maintenance is fully taken on by Amazon Redshift and includes all facets of database management. From a performance perspective the ability to query, load, export, Operations 18 backup, restore and resize is parallelized for users.

Once in maintenance mode, Amazon Redshift monitors the health of a variety of components and failure conditions within an AZ and recovers from them automatically.

Another way Amazon Redshift performs maintenance is through the VACUUM feature, which is a command to remove rows that are no longer needed within the database and then sorts the data.

Running VACUUM is an optimal operation because it reclaims space and resort rows. Amazon Redshift allows its users to DELETE or UPDATE a table, this coupled with Amazon Redshift’s I/O minimization (only relevant data blocks are fetched)—this leads to optimal query performance. It’s important to note that running VACUUM is not required, particularly if Amazon Redshift is used in an append-only fashion.

The VACUUM command is a significant distinction between Amazon Redshift and Google BigQuery. As Amazon Redshift sorts data to fetch only relevant data blocks, it forgoes Google

BigQuery’s method of reading an entire table, which could potentially lead to degraded query performance.

For more information on how Amazon Redshift utilizes the VACUUM command, reference Amazon Redshift’s documentation. Operations 19

Google BigQuery

As aforementioned, Google has managed to solve a lot of common data warehouse concerns by throwing order of magnitude of hardware at the existing problems and thus eliminating them altogether. Unlike Amazon Redshift, running VACUUM in Google BigQuery is not a requirement. Google BigQuery is specifically architected without the need of the resource intensive VACUUM operation that is recommended for Amazon Redshift.

Since Google BigQuery does not require data provisioning, maintenance is much less of an issue because it’s not a requirement for the system to be performant. The advantages to not requiring maintenance is the flexibility of having your data available at all all times without periodic maintenance. However, a potential downside to not requiring maintenance is that users are unable to remove or resort irrelevant data, which can lead to a higher cost since Google BigQuery charges by data processed.

However, Google has implemented ways in which users can reduce the amount of data processed:

• Partition their tables by specifying partition date in their queries

• Use wildcard tables to shard their data by an attribute

In a Google blog post titled BigQuery Under the Hood, it states,

Operations 20

“The ultimate value of BigQuery is not in the fact that it gives you incredible computing scale, it’s that you are able to leverage this scale for your everyday SQL queries, without ever so much as thinking about software, virtual machines, networks or disks.”

Keep in mind that by design, Google BigQuery is append-only. Meaning, when planning to update or delete data, you’ll need to truncate the entire table and recreate the table with new data.

Conclusion

While Google BigQuery may have instances where it is arguably faster than Amazon Redshift—it sacrifices organizational structure in favor of no maintenance. However, Amazon Redshift customers have appreciated its ability to optimize cost and performance for their use case.

In terms of maintenance, the key difference between Amazon Redshift and Google BigQuery is running the VACUUM command. For Amazon Redshift, this is possible and optimizes the ability to free up deleted space, sort data blocks and retrieve only relevant data.

In Google BigQuery, VACUUM is not required. It’s important to note that even though there is no VACUUM option, Google has architected ways in which users can reduce the amount of data processed without performing a full table read of the data. For many users, this is an advantage because they don’t have to proactively work maintenance into their operations, even in times of low activity. Operations 21

Security

As both Amazon Redshift and Google BigQuery are petabyte- scale, cloud-based data warehouses, they both possess their own set of parameters when tackling end-to-end security for their customers. Data warehouses require a flexible and powerful security infrastructure and operate with scalable requirements. Both Amazon Redshift and Google BigQuery (alongside their parent ecosystems) take security very seriously but handle it in different ways.

Amazon Redshift

As an Amazon Redshift user, you’re able to manage security for your data warehouse in a multitude of ways, including encrypting your workloads end-to-end to protecting access to your cluster, managing overall access to specific users and leveraging your own hardware security module (HSM, either on-premise or via AWS servers fully-managed service). For the purposes of this whitepaper, we’ll focus on a few:

• Virtual Private Cloud (VPC): To protect access to your cluster by using a virtual networking environment, you can launch your cluster in an Amazon VPC.

• SSL connections: To encrypt the connection between your SQL client and your cluster, you can use SSL encryption.

• Cluster encryption: To encrypt data in all your user-created tables, you can enable cluster encryption when you launch Cost 22

the cluster.

Enterprise customers such as NASDAQ, Finra and NTT Docomo have relied on Amazon Redshift for years and have accumulated petabyte-scale because of its security and compliance. For a full list on how to manage security within Amazon Redshift, read their Security Overview documentation.

Google BigQuery

Much like the entire Google Cloud Platform, Google BigQuery also encrypts all data at rest by default. Data encryption is a process that takes readable data as input and transforms it into an output that reveals little to no information about the input.

For the Google Cloud platform ecosystem, encryption at rest reduces attacks and allows systems, like a Business Intelligence tool, to manipulate data for analysis without providing access to content.

For the Google Cloud Platform, “encryption at rest reduces the surface of attack by effectively ‘cutting out’ the lower layers of hardware and software stack. For a full overview on Google Cloud Platform’s Security and Compliance, read their documentation.

Cost 23

Cost

As data becomes more voluminous, accessible and cost efficient—every business must leverage data for their advantage. Of the data warehouses on the market, Amazon Redshift and Google BigQuery are the most cost effective solutions without compromising efficacy.

Amazon Redshift

Having been on the market for some time, AWS has a large and mature ecosystem with more than 10 years of experience that offers a full suite of cloud computing services (EMR, RDS, Kinesis, S3, EC2) that has years over the Google Cloud Platform ecosystem. If you’re already invested in the AWS ecosystem, it’s cost effective to continue to purchase their services.

Amazon Redshift offers on-demand pricing (see below) and several Reserved Instance (RI) options of up to 75% discounted over on-demand pricing. This pricing model leaves users with the ability to construct their data warehouse according to their business needs.

With on-demand pricing, Amazon Redshift charges per-hour per-node, which includes both compute and storage. This pricing model is predictable as users are able to run as many queries as necessary without being penalized by a high cost. Pricing starts at $0.25 per hour for 160GB of data. A cluster of 8 X dc1.large nodes costs $1,440 for a 30-day month. Cost 24

Additionally, users can purchase two nodes (users get the leader node for free, so in total there are three nodes) for $360 for a 30-day month.

However, it’s important to note that since Amazon Redshift has fixed compute (SSD) and storage (HDD), scaling one requires scaling the other and therefore attributes to overall cost.

Google BigQuery

Since Google BigQuery is able to separate compute and storage, it allows for an extremely flexible pay-as-you-use pricing model (it charges by GB usage, starting at $0.02 per GB per month) This has allowed companies with smaller data sets to experiment with a data warehouse without running up a large purchase order. However, it’s critical to note that running queries is a separate cost from storing data.

For Google BigQuery, query pricing is the cost of running your SQL commands and user-defined functions and changes by the number of bytes processed. Query pricing has three tiers: 1TB for free, $5 per TB for on-demand tier and $40,000 for 2,000 queries for flat-rate pricing.

Though Google has been transparent that flat-rate pricing is aimed for the enterprise, Google BigQuery does offer a handful of free capabilities including: loading data, copying data, exporting data and metadata operations.

Cost 25

A downside to the pay-as-you-use model is that pricing is less transparent and hard to predict for long-term budgeting. Users have stated that complex analytics and query errors can end up in an unexpected cost. However, BigQuery now provides a Cost Control feature that enables users to set a cap to daily costs. As the Google BigQuery administrator, you’ll proactively have to manage the daily quota querying and once the cost effective query limit has been reached, they’ll have to communicate it with their team and cease performing queries until the cost has rebalanced.

Conclusion

For scaling startups to large enterprise companies, cost can greatly fluctuate. For companies that want more predictable pricing, Amazon Redshift is ideal with pricing starting at $0.25 with the up to a 70% discount through Reserved Instance.

On the other hand, Google BigQuery now has a predictable flat- rate pricing of $40,000 per month. However, this is targeted for enterprise companies as smaller companies may not be able to rationalize the high cost. In calculating Google BigQuery’s monthly cost, it would require more strict query monitoring and storage. Use Cases 26

Use Cases

Now that we’ve covered seven critical differentiators between Amazon Redshift and Google BigQuery, it’s optimal to learn how companies are using each data warehouse in their daily operations.

How Everlane Uses Amazon Redshift for Daily Projected Revenue

Everlane is an online retailer of luxury clothing that manufactures their own designs with a mission in being transparent about costs and production. As an online retailer, Everlane collects a massive amount of data from their customers and houses it in Amazon Redshift.

One of the most compelling analyses that Everlane performs is reporting on Daily Projected Revenue via data surfaced through Amazon Redshift.

Everlane is able to perform this analysis by blending several different layers pulling from Amazon Redshift (for historical data) and a MySQL database (for real-time data). Everlane has organized their Amazon Redshift data to store data in their Amazon Redshift cluster(s) for the past 50 days, down to the minute and hour. This level of organization and proactive thought in node organization is critical for Everlane when querying data for their daily projections, which is only possible Use Cases 27 via Amazon Redshift.

These layers then formulate the analysis of how much the company is projected to make in a single day, which in turn informs overall net new revenue. In using Amazon Redshift, it makes it incredibly easy for Everlane to access historical data, allowing them to hypothesize about their customers and query as much as they needed without any cost penalties.

Amazon Redshift is used by leading companies, both large and small, such as Yelp, The Weather Channel, theSkimm, Johnson & Johnson and Clever. Click here for a full list of customers and to read their success stories.

How Reddit Uses Google BigQuery for Personalized Sales Pitches

Reddit is the largest open source online community that allows users to contribute by adding new content, commenting and upvoting discussions. As a fast-growing community, Reddit needed a scalable solution and chose Google BigQuery.

Reddit has a Google BigQuery data set for every comment made on their site from January 2014 to present. According to Reddit traffic stats, the site on average receives one million unique visitors a day, which calculates to a massive amount of data.

With this data set, the data team can run a sentiment analysis of the comments (using Cornell’s Sentiment Analysis Framework). Use Cases 28

This analysis is done by querying the data set and counting the number of positive and negative words based on a specific keyword. From this report, Reddit is able to see the number of keyword mentions, the number of comments made and the number of subreddits where the keyword is mentioned.

Reddit is able to house their unstructured comment data in Google BigQuery and run this type of analysis because Google has separated compute from storage and has an order of magnitude that is able to run such a query for clients.

This kind of analysis is useful for when the Sales team pitches to brands, so depending on the number of queries the data team is running and dashboards created for the Sales team—that amounts to a lot of data and queries.

Google BigQuery is used by leading companies, both large and small, such as the New York Times, Spotify, Coca-Cola and Wepay. Click here for a full list of customers and to read their success stories. Conclusion 29

Conclusion

While both data warehouses are optimized for analytics at a large volume, there are enough differences between the two for you to make an informed decision when evaluating in a system.

With Google BigQuery rapidly growing in popularity with scaling companies, it still has its own hurdles in pricing transparency. However, it does excel in terms of getting your data warehouse off the ground and running quickly without having to provision appropriately.

Amazon Redshift offers a more mature, agile and standard data warehouse with all the high-performing capabilities of a cloud-based data warehouse with tons of documentation, cost- effectiveness, flexibility and use cases.

At the end of the day, it’s also important to consider that if your company is already a heavy user of AWS or the Google Cloud Platform—it’s wise to stay with one system as both systems end goal allow for interactive reporting and insights from data.

In our next whitepaper, we’ll present data in-depth benchmarks between Amazon Redshift and Google BigQuery, particularly around performance, operations and cost. About Chartio 30

About Chartio

Chartio is a powerful, cloud-based data exploration tool that allows anyone to explore their data through Chartio’s SQL and Interactive Mode. Now anyone within an organization has the power to explore data to make more informed decisions instantaneously.

We’re thrilled to partner with both Amazon Web Services and Google Cloud Platform, allowing customers to combine and query their data to perform analyses to drive their business forward. References 31

References

“Amazon Redshift - Up to 2X Throughout and 10X Vacuuming Performance Improvements.” AWS Blog. N.p., 24 May 2016. Web 17 Oct. 2016. (Source)

“Amazon Redshift Adds New Dense Storage (DS2) Instances and Reserved Node Payment Options.” AWS. N.p., Web 17 Oct. 2016. (Source)

“Amazon Redshift Clusters.” AWS Documentation. N.p., Web 10 Aug. 2016. (Source)

“Amazon Redshift Database Encryption.” AWS Documentation. N.p., Web 10 Oct. 2016. (Source)

“Amazon Redshift Security Overview.” AWS Documentation. N.p., Web 10 Oct. 2016. (Source)

“Askreddit.” Traffic Stats. N.p., Web 18 Aug. 2016.Source ( )

“AWS Service Limits.” AWS Documentation. N.p., Web 10 Aug. 2016. (Source)

“BigQuery under the Hood - Google Cloud Big Data and Machine Learning Blog - Google Cloud Platform.” . N.p., 27 Jan. 2015. Web 12 Aug. 2016. (Source)

“Cost Controls - BigQuery Documentation - Google Cloud Platform.” Google Developers. N.p., Web. 11 Aug. 2016. (Source) References 32

“Creating and Querying Federated Data Sources - BigQuery Documentation - Google Cloud Platform.” Google Developers. N.p., Web. 31 Oct. 2016. (Source)

“Defining Query Queues.” AWS Documentation. N.p., Web. 10 Aug. 2016. (Source)

“Inside Capacitor, BigQuery’s next-generation columnar storage format.” Google Cloud Platform Blog. N.p., 26 Apr. 2016. Web. 12 Oct. 2016. (Source)

“Loading Data - BigQuery Documentation - Google Cloud Platform.” Google Developers. N.p., Web. 11 Aug. 2016. (Source)

Pang, Bo, and Lillian Lee. Opinion Mining and Sentiment Analysis. N.p.,: Cornell University. 2008. PDF. (Source)

“Pricing - BigQuery Documentation - Google Cloud Platform.” Google Developers. N.p., Web. 11 Aug. 2016. (Source)

“Pulling Back the Curtain on Google’s Network Infrastructure.” Google Research Blog. N.p., Web. 11 Oct. 2016. (Source)

“Quota Policy - BigQuery Documentation - Google Cloud Platform.” Google Developers. N.p., Web. 11 Aug. 2016. (Source)

“Take Your Big Data to New Places with Google BigQuery.” Google Cloud Platform Blog. N.p., 17 Apr. 2015. Web. 12 Aug. 2016. (Source)

References 33

Tereshko, Tino. “15 Awesome Things You Probably Didn’t Know about Google BigQuery.” Medium. N.p., 21 Oct. 2016. Web. 2 Nov. 2016. (Source)

Tigani, Jordan, and Siddartha Naidu. Google BigQuery Analytics. N.p.,: Wiley, 2014. Print.

“Vacuum.” AWS Documentation. N.p., Web 10 Oct. 2016. (Source)

“What is BigQuery? BigQuery - Google Cloud Platform.” Google Developers. N.p., Web. 11 Aug. 2016. (Source)

“Workload Management.” AWS Documentation. N.p., Web. 10 Oct. 2016. (Source) www.chartio.com - ©2016 Chartio, Inc. All rights reserved.