Amazon Redshift & Amazon Dynamodb

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH 2014-05-15 © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Redshift & Amazon DynamoDB Amazon Redshift Amazon Redshift Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year A fully managed data warehouse service • Massively parallel relational data warehouse • Takes care of cluster management and distribution of your data • Columnar data store with variable compression • Optimized for complex queries across many large tables • Use standard SQL & standard BI tools Amazon Redshift Amazon DynamoDB A fully managed fast key-value store • Fast, predictable performance • Simple and fast to deploy • Easy to scale as you go, up to millions of IOPS • Pay only for what you use: Read / write IOPS + storage • Data is automatically replicated across data centers Amazon DynamoDB Amazon DynamoDB Amazon Redshift • Fast insert & update • Fast queries • Limited query • Flexible queries capability (single (JOINs, aggregation table only) functions, …) • NoSQL database • SQL Queries in Amazon DynamoDB Queries in Amazon DynamoDB • Query or BatchQuery APIs retrieve items • Scan & filter to comb through a whole table • You have to join tables in your own code! Amazon DynamoDB Queries in Amazon DynamoDB (2) • Apache Hive on Amazon EMR can access data in DynamoDB • Run HiveQL queries for bulk processing • Can integrate data in HDFS, Amazon S3, … HiveQL queries on Amazon EMR Amazon DynamoDB Queries in Amazon DynamoDB (3) • Import data into Amazon Redshift • Use SQL queries, use BI tools etc. • Powerful analytics and aggregation functions Amazon Redshift Amazon DynamoDB Importing Data into Amazon Redshift TMTOWTDI Query & Insert #1 Query / BatchQuery #3 INSERT … INTO (…) Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert The Good The Bad • Full control over queries • Slow • Decide which items you • Inefficient on the Redshift want to move to Redshift side of things • Process data on the way • Does not scale well The COPY Command #1 COPY FROM … #2 Politely ask for a table Amazon Amazon DynamoDB #3 Return whole table Redshift The COPY Command #1 COPY FROM … Amazon Amazon DynamoDB #2 Parallel Scans Redshift The COPY Command #1 COPY FROM … Amazon Amazon DynamoDB #3 Return Items Redshift The COPY Command • COPY a single table at a time • From one Amazon DynamoDB table into one Amazon Redshift table • Fast – executed in parallel on all data nodes in the Amazon Redshift cluster • Can be limited to use a certain percentage of provisioned throughput on the DynamoDB table The COPY Command COPY <table_name> (col1, col2, …) FROM 'dynamodb://<table_name2>' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' READRATIO 10 -- use 10% of available read capacity COMPROWS 0 -- how many rows to read to determine -- compression […other options…] The COPY Command • Attributes are mapped to columns by name • Case of column names is ignored • Attributes that do not map are ignored • Missing attributes are stored as NULL or empty values • Only works for STRING and NUMBER attributes The COPY Command The Good The Bad • Easy to use • Whole tables only • Fast • No processing in between • Efficient use of resources • Can only copy from • Scales linearly with DynamoDB in same region cluster size • Only works with STRING • Only uses certain and NUMBER types percentage of read throughput Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Insert at Scale #1 Query / BatchQuery #3 INSERT … INTO (…) in parallel in parallel Amazon EMR Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR #1 Query / BatchQuery #4 COPY… FROM s3:// in parallel #3 Export to Amazon EMR file(s) on S3 Amazon #2 Retrieve Items #5 Retrieve filesAmazon DynamoDB Redshift Amazon S3 Query & Import using Amazon EMR #1 Query / BatchQuery #3 COPY … FROM emr:// in parallel Amazon EMR #4 Retrieve files from HDFS Amazon #2 Retrieve Items Amazon DynamoDB Redshift Query & Import using Amazon EMR The Good The Bad • Decide which items you • Additional complexity want to move to Redshift • Additional cost (for EMR) • Full control over queries • Slower than direct COPY • Process data on the way from Amazon DynamoDB • Scales well • Integrates with other data sources easily Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist clipkit GmbH Video Syndication – The Possibilities Content – Partner Overview News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment clipkit Player – Analytics (Metrics) Full Screen Category Playlist Pos. Play / Pause Progress Pos. Mute / Unmute Volume clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc… First Implementation (Expensive and Slow) • designed in starting days • not calculated to such amount of data • slow copy process from S3 to DB (PHP application old architecture) • fix EC2 price (expensive to support peak hours) • PostgreSQL scalability limitations • sometimes the copy process was so slow that the delay was ~3 days. Analytics / Metrics (Requests Graph) Analytics / Metrics (Numbers) • ~ 6,000,000 New Entries per day • ~ 1,000 Requests per second (Peak Hours) • ~ 25 Requests per second (Off-peak Hours) 4000% Requests Growth during the day. Second Implementation (Expensive and Slow) • Inserting only for one (big) Table • The copy command only works for whole tables • The minimum delay was one day • Our solution have increase the provisioned throughput and that was expensive NO REAL-TIME DATA Third Implementation (Cheap and Fast) Third Implementation – Dynamo DB • Java SDK AmazonDynamoDBAsyncClient (Fire and Go) • Easy to Create and Delete Tables • Write Latency ~5ms • Throughput auto scale with Dynamic DynamoDB • One Table per day • Continuous Iteration and copy to Redshift • We just pay for what we use Third Implementation – Redshift • Standard PostgreSQL JDBC • Fully managed by Amazon • Automated Backups and Fast Restores • ~7000 Insert Items per Second • Less than 2 seconds Queries to > 1 billion entries • Real-time available data (maximum 1 minute delay) Third Implementation – Conclusions • Java Web Application – Auto Scale (Off-Peak - 1 Small Instance) • Dynamo DB – One Table per day (After copied it will be deleted) – Auto Scale – ~5 ms Put Item Latency • Redshift – Insert ~7000 Items per second – Fully managed Thank You! .

Amazon Redshift & Amazon Dynamodb

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support