Amazon Redshift & DynamoDB Michael Hanisch, Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH 2014-05-15

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Redshift & Amazon DynamoDB Amazon Redshift Amazon Redshift

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

A fully managed data warehouse service • Massively parallel relational data warehouse • Takes care of cluster management and distribution of your data • Columnar data store with variable compression • Optimized for complex queries across many large tables • Use standard SQL & standard BI tools

Amazon Redshift Amazon DynamoDB A fully managed fast key-value store • Fast, predictable performance • Simple and fast to deploy • Easy to scale as you go, up to millions of IOPS • Pay only for what you use: Read / write IOPS + storage • Data is automatically replicated across data centers

Amazon DynamoDB Amazon DynamoDB Amazon Redshift

• Fast insert & update • Fast queries • Limited query • Flexible queries capability (single (JOINs, aggregation table only) functions, …) • NoSQL • SQL Queries in Amazon DynamoDB Queries in Amazon DynamoDB • Query or BatchQuery APIs retrieve items • Scan & filter to comb through a whole table • You have to join tables in your own code!

Amazon DynamoDB Queries in Amazon DynamoDB (2) • Apache Hive on Amazon EMR can access data in DynamoDB • Run HiveQL queries for bulk processing • Can integrate data in HDFS, , …

HiveQL queries on Amazon EMR Amazon DynamoDB Queries in Amazon DynamoDB (3) • Import data into Amazon Redshift • Use SQL queries, use BI tools etc. • Powerful analytics and aggregation functions

Amazon Redshift Amazon DynamoDB Importing Data into Amazon Redshift TMTOWTDI Query & Insert

#1 Query / BatchQuery #3 INSERT … INTO (…)

Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert

The Good The Bad • Full control over queries • Slow • Decide which items you • Inefficient on the Redshift want to move to Redshift side of things • Process data on the way • Does not scale well The COPY Command

#1 COPY FROM …

#2 Politely ask for a table

Amazon Amazon DynamoDB #3 Return whole table Redshift The COPY Command

#1 COPY FROM …

Amazon Amazon DynamoDB #2 Parallel Scans Redshift The COPY Command

#1 COPY FROM …

Amazon Amazon DynamoDB #3 Return Items Redshift The COPY Command • COPY a single table at a time • From one Amazon DynamoDB table into one Amazon Redshift table • Fast – executed in parallel on all data nodes in the Amazon Redshift cluster • Can be limited to use a certain percentage of provisioned throughput on the DynamoDB table The COPY Command

COPY (col1, col2, …) FROM 'dynamodb://' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' READRATIO 10 -- use 10% of available read capacity COMPROWS 0 -- how many rows to read to determine -- compression […other options…]

The COPY Command • Attributes are mapped to columns by name • Case of column names is ignored • Attributes that do not map are ignored • Missing attributes are stored as NULL or empty values • Only works for STRING and NUMBER attributes The COPY Command

The Good The Bad • Easy to use • Whole tables only • Fast • No processing in between • Efficient use of resources • Can only copy from • Scales linearly with DynamoDB in same region cluster size • Only works with STRING • Only uses certain and NUMBER types percentage of read throughput Query & Insert at Scale

#1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel

Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale

#1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR

Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale

#1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR

Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR

#1 Query / BatchQuery #4 COPY… FROM s3:// in parallel #3 Export to Amazon EMR file(s) on S3

Amazon #2 Retrieve Items #5 Retrieve filesAmazon DynamoDB Redshift Amazon S3 Query & Import using Amazon EMR

#1 Query / BatchQuery #3 COPY … FROM emr:// in parallel Amazon EMR

#4 Retrieve files from HDFS

Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR

The Good The Bad • Decide which items you • Additional complexity want to move to Redshift • Additional cost (for EMR) • Full control over queries • Slower than direct COPY • Process data on the way from Amazon DynamoDB • Scales well • Integrates with other data sources easily Please welcome

Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist clipkit GmbH

Video Syndication – The Possibilities Content – Partner Overview

News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment clipkit Player – Analytics (Metrics)

Full Screen

Category

Playlist Pos.

Play / Pause

Progress Pos. Mute / Unmute Volume clipkit Player – Analytics (Metrics)

Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc…

First Implementation (Expensive and Slow)

• designed in starting days • not calculated to such amount of data • slow copy process from S3 to DB (PHP application old architecture) • fix EC2 price (expensive to support peak hours) • PostgreSQL scalability limitations • sometimes the copy process was so slow that the delay was ~3 days. Analytics / Metrics (Requests Graph) Analytics / Metrics (Numbers) • ~ 6,000,000 New Entries per day • ~ 1,000 Requests per second (Peak Hours) • ~ 25 Requests per second (Off-peak Hours)

4000% Requests Growth during the day. Second Implementation (Expensive and Slow)

• Inserting only for one (big) Table • The copy command only works for whole tables • The minimum delay was one day • Our solution have increase the provisioned throughput and that was expensive

NO REAL-TIME DATA Third Implementation (Cheap and Fast) Third Implementation – DB

• Java SDK AmazonDynamoDBAsyncClient (Fire and Go) • Easy to Create and Delete Tables • Write Latency ~5ms • Throughput auto scale with Dynamic DynamoDB

• One Table per day • Continuous Iteration and copy to Redshift • We just pay for what we use Third Implementation – Redshift

• Standard PostgreSQL JDBC • Fully managed by Amazon • Automated Backups and Fast Restores

• ~7000 Insert Items per Second • Less than 2 seconds Queries to > 1 billion entries • Real-time available data (maximum 1 minute delay) Third Implementation – Conclusions • Java Web Application – Auto Scale (Off-Peak - 1 Small Instance) • Dynamo DB – One Table per day (After copied it will be deleted) – Auto Scale – ~5 ms Put Item Latency • Redshift – Insert ~7000 Items per second – Fully managed Thank You!