Amazon Redshift & Amazon Dynamodb

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH 2014-05-15 © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Redshift & Amazon DynamoDB Amazon Redshift Amazon Redshift Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year A fully managed data warehouse service • Massively parallel relational data warehouse • Takes care of cluster management and distribution of your data • Columnar data store with variable compression • Optimized for complex queries across many large tables • Use standard SQL & standard BI tools Amazon Redshift Amazon DynamoDB A fully managed fast key-value store • Fast, predictable performance • Simple and fast to deploy • Easy to scale as you go, up to millions of IOPS • Pay only for what you use: Read / write IOPS + storage • Data is automatically replicated across data centers Amazon DynamoDB Amazon DynamoDB Amazon Redshift • Fast insert & update • Fast queries • Limited query • Flexible queries capability (single (JOINs, aggregation table only) functions, …) • NoSQL database • SQL Queries in Amazon DynamoDB Queries in Amazon DynamoDB • Query or BatchQuery APIs retrieve items • Scan & filter to comb through a whole table • You have to join tables in your own code! Amazon DynamoDB Queries in Amazon DynamoDB (2) • Apache Hive on Amazon EMR can access data in DynamoDB • Run HiveQL queries for bulk processing • Can integrate data in HDFS, Amazon S3, … HiveQL queries on Amazon EMR Amazon DynamoDB Queries in Amazon DynamoDB (3) • Import data into Amazon Redshift • Use SQL queries, use BI tools etc. • Powerful analytics and aggregation functions Amazon Redshift Amazon DynamoDB Importing Data into Amazon Redshift TMTOWTDI Query & Insert #1 Query / BatchQuery #3 INSERT … INTO (…) Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert The Good The Bad • Full control over queries • Slow • Decide which items you • Inefficient on the Redshift want to move to Redshift side of things • Process data on the way • Does not scale well The COPY Command #1 COPY FROM … #2 Politely ask for a table Amazon Amazon DynamoDB #3 Return whole table Redshift The COPY Command #1 COPY FROM … Amazon Amazon DynamoDB #2 Parallel Scans Redshift The COPY Command #1 COPY FROM … Amazon Amazon DynamoDB #3 Return Items Redshift The COPY Command • COPY a single table at a time • From one Amazon DynamoDB table into one Amazon Redshift table • Fast – executed in parallel on all data nodes in the Amazon Redshift cluster • Can be limited to use a certain percentage of provisioned throughput on the DynamoDB table The COPY Command COPY <table_name> (col1, col2, …) FROM 'dynamodb://<table_name2>' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' READRATIO 10 -- use 10% of available read capacity COMPROWS 0 -- how many rows to read to determine -- compression […other options…] The COPY Command • Attributes are mapped to columns by name • Case of column names is ignored • Attributes that do not map are ignored • Missing attributes are stored as NULL or empty values • Only works for STRING and NUMBER attributes The COPY Command The Good The Bad • Easy to use • Whole tables only • Fast • No processing in between • Efficient use of resources • Can only copy from • Scales linearly with DynamoDB in same region cluster size • Only works with STRING • Only uses certain and NUMBER types percentage of read throughput Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR #1 Query / BatchQuery #4 COPY… FROM s3:// in parallel #3 Export to Amazon EMR file(s) on S3 Amazon #2 Retrieve Items #5 Retrieve filesAmazon DynamoDB Redshift Amazon S3 Query & Import using Amazon EMR #1 Query / BatchQuery #3 COPY … FROM emr:// in parallel Amazon EMR #4 Retrieve files from HDFS Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR The Good The Bad • Decide which items you • Additional complexity want to move to Redshift • Additional cost (for EMR) • Full control over queries • Slower than direct COPY • Process data on the way from Amazon DynamoDB • Scales well • Integrates with other data sources easily Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist clipkit GmbH Video Syndication – The Possibilities Content – Partner Overview News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment clipkit Player – Analytics (Metrics) Full Screen Category Playlist Pos. Play / Pause Progress Pos. Mute / Unmute Volume clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc… First Implementation (Expensive and Slow) • designed in starting days • not calculated to such amount of data • slow copy process from S3 to DB (PHP application old architecture) • fix EC2 price (expensive to support peak hours) • PostgreSQL scalability limitations • sometimes the copy process was so slow that the delay was ~3 days. Analytics / Metrics (Requests Graph) Analytics / Metrics (Numbers) • ~ 6,000,000 New Entries per day • ~ 1,000 Requests per second (Peak Hours) • ~ 25 Requests per second (Off-peak Hours) 4000% Requests Growth during the day. Second Implementation (Expensive and Slow) • Inserting only for one (big) Table • The copy command only works for whole tables • The minimum delay was one day • Our solution have increase the provisioned throughput and that was expensive NO REAL-TIME DATA Third Implementation (Cheap and Fast) Third Implementation – Dynamo DB • Java SDK AmazonDynamoDBAsyncClient (Fire and Go) • Easy to Create and Delete Tables • Write Latency ~5ms • Throughput auto scale with Dynamic DynamoDB • One Table per day • Continuous Iteration and copy to Redshift • We just pay for what we use Third Implementation – Redshift • Standard PostgreSQL JDBC • Fully managed by Amazon • Automated Backups and Fast Restores • ~7000 Insert Items per Second • Less than 2 seconds Queries to > 1 billion entries • Real-time available data (maximum 1 minute delay) Third Implementation – Conclusions • Java Web Application – Auto Scale (Off-Peak - 1 Small Instance) • Dynamo DB – One Table per day (After copied it will be deleted) – Auto Scale – ~5 ms Put Item Latency • Redshift – Insert ~7000 Items per second – Fully managed Thank You! .

Load more