A N T 3 3 4 - R Migrate your data warehouse to the cloud in record time, featuring Fannie Mae
Tony Gibbs Amy Tseng Sr. Database Solutions Architect Engineering Manager Amazon Web Services Fannie Mae
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture and concepts Accelerating your data warehouse migration Fannie Mae enterprise data warehouse and data mart migration Agenda Additional resources Open Q&A Are you an Amazon Redshift user? Amazon Redshift: Data warehousing
ANSI SQL—ACID data warehouse Fast, powerful, and simple data warehousing at 1/10 the cost Massively parallel, petabyte scale
Fast Inexpensive Scalable Secure
Columnar storage technology As low as $1,000 per terabyte Resize your cluster up Data encrypted at rest and to improve I/O efficiency and per year, 1/10 the cost and down as your transit; isolate clusters with parallelize queries; of traditional data performance and capacity VPC; manage your own keys data load scales linearly warehouse solutions needs change with AWS KMS © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Cloud
Amazon Amazon AWS Identity Amazon Simple VPC and Access EC2 Workflow Management Service (IAM) OLAP
MPP
PostgreSQL Columnar Amazon Redshift
Amazon Simple AWS Key Amazon Amazon Storage Management Route 53 CloudWatch Service (S3) Service December 2019
February 2013
> 175 significant patches
> 300 significant features Amazon Redshift architecture
Massively parallel, shared nothing SQL clients/BI tools columnar architecture JDBC/ODBC Leader node Leader SQL endpoint node Stores metadata Coordinates parallel SQL processing Compute Compute Compute Compute nodes node node node Local, columnar storage LoadLoad Amazon Executes queries in parallel UnloadQuery ... Redshift 1 2 3 4 N Load, unload, backup, restore Backup Spectrum Restore Amazon Redshift Spectrum nodes Execute queries directly against Amazon S3 Amazon Simple Storage Service (Amazon S3) Terminology and concepts: Columnar
Amazon Redshift uses a columnar architecture for storing data on disk
Goal: reduce I/O for analytics queries
Physically store data on disk by column rather than row
Only read the column data that is required Columnar architecture: Example
CREATE TABLE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt SELECT min(dt) FROM deep_dive;
Row-based storage • Need to read everything • Unnecessary I/O Columnar architecture: Example
CREATE TABLE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt SELECT min(dt) FROM deep_dive;
Column-based storage • Only scan blocks for relevant column Terminology and concepts: Compression
Goals Impact Allow more data to be stored within Allows two to four times more data an Amazon Redshift cluster to be stored within the cluster Improve query performance by decreasing I/O
By default, COPY automatically analyzes and compresses data on first load into an empty table
ANALYZE COMPRESSION is a built-in command that will find the optimal compression for each column on an existing table Compression: Example
CREATE TABLE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt Add 1 of 13 different encodings to each column Compression: Example
CREATE TABLE deep_dive ( aid loc dt aid INT ENCODE AZ64 ,loc CHAR(3) ENCODE BYTEDICT 1 SFO 2017-10-20 ,dt DATE ENCODE RUNLENGTH 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt More efficient compression is due to storing the same data type in the columnar architecture Columns grow and shrink independently Reduces storage requirements Reduces I/O Terminology and concepts: Blocks
Column data is persisted to 1 MB immutable blocks
Blocks are individually encoded with 1 of 13 encodings
A full block can contain millions of values Terminology and concepts: Zone maps
Goal Eliminates unnecessary I/O
In-memory block metadata • Contains per-block min and max values • All blocks automatically have zone maps • Effectively prunes blocks that cannot contain data for a given query Terminology and concepts: Data sorting
Goal Impact Make queries run faster by Enables range-restricted scans increasing the effectiveness of to prune blocks by leveraging zone maps and reducing I/O zone maps
Achieved with the table property SORTKEY defined on one or more columns
Optimal sort key is dependent on: • Query patterns • Business requirements • Data profile Sort key: Example
CREATE TABLE deep_dive ( Add a sort key to one or more columns aid INT --audience_id to physically sort the data on disk ,loc CHAR(3) --location ,dt DATE --date );) SORTKEY (dt, loc);
deep_dive deep_dive (sorted) aid loc dt aid loc dt 1 SFO 2017-10-20 3 SFO 2017-04-01 2 JFK 2017-10-20 4 JFK 2017-05-14 3 SFO 2017-04-01 2 JFK 2017-10-20 4 JFK 2017-05-14 1 SFO 2017-10-20 Zone maps and sorting: Example
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017'; Unsorted table Sorted by date
MIN: 01-JUNE-2017 MIN: 01-JUNE-2017 MAX: 20-JUNE-2017 MAX: 06-JUNE-2017
MIN: 08-JUNE-2017 MIN: 07-JUNE-2017 MAX: 30-JUNE-2017 MAX: 12-JUNE-2017
MIN: 12-JUNE-2017 MIN: 13-JUNE-2017 MAX: 20-JUNE-2017 MAX: 21-JUNE-2017
MIN: 02-JUNE-2017 MIN: 21-JUNE-2017 MAX: 25-JUNE-2017 MAX: 30-JUNE-2017 Terminology and concepts: Slices
A slice can be thought of like a virtual compute node Facts about slices
• Unit of data partitioning • Each compute node has either 2 or 16 slices • Parallel query processing • Table rows are distributed to slices • A slice processes only its own data Data distribution
Distribution style is a table property which KEY dictates how that table’s data is distributed throughout the cluster • KEY: value is hashed, same value goes to same location (slice) Slice Slice Slice Slice • ALL: full table data goes to the first slice of every node 1 2 3 4 Node 1 Node 2 • EVEN: round robin • AUTO: Combines EVEN and ALL ALL Goals EVEN • Distribute data evenly for parallel processing Slice Slice Slice Slice Slice Slice Slice Slice 1 2 3 4 1 2 3 4 • Minimize data movement Node 1 Node 2 Node 1 Node 2 during query processing Data distribution: Example
CREATE TABLE deep_dive ( Table: deep_dive aid INT --audience_id User Columns System Columns ,loc CHAR(3) --location ,dt DATE --date aid loc dt ins del row ) (EVEN|KEY|ALL|AUTO);
Slice 0 Slice 1 Slice 2 Slice 3
Node 1 Node 2 Data distribution: EVEN Example
CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE EVEN; (4, 'JFK', '2017-05-14');
Table: deep_dive Table: deep_dive Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row
Rows: 10 Rows: 01 Rows: 10 Rows: 01
Slice 0 Slice 1 Slice 2 Slice 3
Node 1 Node 2 Data distribution: KEY Example #1
CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE KEY DISTKEY (loc); (4, 'JFK', '2017-05-14');
Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row
Rows: 102 Rows: 201 Rows: 0 Rows: 0
Slice 0 Slice 1 Slice 2 Slice 3
Node 1 Node 2 Data distribution: KEY Example #2
CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE KEY DISTKEY (aid); (4, 'JFK', '2017-05-14');
Table: deep_dive Table: deep_dive Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row
Rows: 10 Rows: 01 Rows: 10 Rows: 01
Slice 0 Slice 1 Slice 2 Slice 3
Node 1 Node 2 Data distribution: ALL Example
CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE ALL; (4, 'JFK', '2017-05-14');
Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row
Rows: 13042 Rows: 0 Rows: 13042 Rows: 0
Slice 0 Slice 1 Slice 2 Slice 3
Node 1 Node 2 Summary: Data distribution
DISTSTYLE KEY Goals Optimize JOIN performance between large tables by distributing on columns used in the ON clause Optimize INSERT INTO SELECT performance
Optimize GROUP BY performance SELECT diststyle, skew_rows • The column that is being distributed on should FROM svv_table_info WHERE "table" = 'deep_dive'; have a high cardinality and not cause row skew: diststyle | skew_rows ------+------DISTSTYLE ALL KEY(aid) | 1.07 Ratio between the slice with the Goals most and least number of rows Optimize Join performance with dimension tables Reduces disk usage on small tables • Small and medium size dimension tables (< 3M rows) DISTSTYLE EVEN • If neither KEY or ALL apply DISTSTYLE AUTO • Default Distribution - Combines DISTSTYLE ALL and EVEN © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS migration tooling
AWS Schema Conversion Tool (SCT) converts your commercial database and data warehouse schemas to open-source engines or AWS-native services, such as Amazon Aurora and Amazon Redshift
AWS Database Migration Service (DMS) easily and securely migrates and/or replicate your databases and data warehouses to AWS AWS Schema Conversion Tool
The AWS Schema Conversion Tool helps automate database schema and code conversion tasks when migrating from source to target database engines Features Create assessment reports for homogeneous/heterogeneous migrations Convert Convert database schema Convert data warehouse schema Convert embedded application code Code browser that highlights places where manual edits are required Source DB AWS SCT Target DB Secure connections to your databases with SSL Service substitutions/ETL modernization to AWS Glue Migrate data to data warehouses using SCT data extractors Optimize schemas in Redshift AWS SCT data extractors
Extract Data from your data warehouse and migrate to Amazon Redshift • Extracts data through local migration agents • Data is optimized for Redshift and saved in local files • Files are loaded to an Amazon S3 bucket (through network or Snowball Edge) and then to Amazon Redshift
Source DW AWS SCT Amazon Amazon S3 Bucket Redshift AWS Database Migration Service
Migrating Migrate between on-premises and AWS databases Migrate between databases to Aws Automated schema conversion
Data replication for zero downtime migration Legacy data warehouse migration tips
When moving from legacy row-based data warehouses • Denormalize tables where it makes sense (predicate columns in the fact table) • Avoid date dimension tables • Amazon Redshift is efficient with wide tables because of columnar storage and compression When moving from SMP legacy data warehouses • Colocation of tables is required for fast JOINS • Leverage DIST STYLE ALL/AUTO or KEY • Redshift is designed for big data (> 100GB to PB scale) • Smaller datasets consider Amazon Aurora PostgreSQL • Wrap workflows in explicit transactions Leverage Amazon Redshift Stored Procedures for faster migrations • PL/pgSQL stored procedures were added to make porting legacy procedures easier © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Home ownership 64% in United States Liquidity Our customers can exchange loans and securities for cash immediately
Access Loans available to support buying 1 in 4 and renting for most Americans Affordability Support the housing finance system to enable reasonable payments for all Homes financed by Fannie Mae Roles of our data warehouse
System of record Serve analytics, Regulatory and Consolidate legacy research, and financial reports data warehouse data science Data warehouse use cases
External data
Loan data from external business partners and customers
Integrated data store
Business systems External Enterprise data warehouse Transform Business and Load (ETL) intelligence and Transaction Data Store Data mart analytics On-premises EDW context diagram 11 6
Transactional Business intelligence layer master data 4 Data integration/data access layer 12 Transactional 1 data store SOR 5 Operational data store (trusted source) Operational Supporting business master data Relationship management Loan Securities Commitments Property Product dev. and Party master management Sourcing management Manage the business Product master Delivery and verification NRT/B Issuance and
securitization Securities master /Batch Collateral 2 Enterprise data warehouse management Integrated
Clearance Ttme Loan master and settlement Servicing History management Agreement master Credit loss
management Plan the business Run business the Security Nearreal NRT/B disclosure 3 Property master On and off balance sheet Data marts Market Product Customer Vendor Amortization Pricing TCO Surveillance Collateral HR Risk forecasting market management Finical hierarchies Perf. metrics, acct. Intra-day Finical CLM Finance and fin. Dis. Loss HCD SF CM Metrics Risk/fraud Disclosures portfolio planning operations reporting Organization External data SOR 13 7 Party master 8 Product master 9 Securities master
10 Loan master New challenges
DW consolidation → 6x–20x transactions in the last 18 months Digital transformation → 3 million queries/month, 4x data growth Successful user adoption → user base growth 4000 Concurrency, scalability and time to market We need
Loan Credit
Self Dynamic Data Cross domain High service scaling sharing access concurrency EDW hybrid architecture
Corporate data center AWS Cloud
Upstream applications
IDS Amazon KMS Amazon STS Amazon IAM AWS On-prem host Direct Connect
S3 API Enterprise Data Lake End users EDW AWS Lambda Amazon S3 EDW (Amazon Redshift) Amazon EMR Down Glue Data stream apps catalog
Data Amazon Athena End users Down BI reports mart stream apps BI reports Data mart migration approach
One-time migration or AWS Cloud continuous data replication
DB AWS DMS AWS SCT Amazon Redshift
10 TB+ DB
AWS Snowball Edge Amazon S3 Amazon Redshift Batch ingestion
Amazon S3 Amazon Redshift Data sharing with Amazon Redshift Spectrum
Business users
Redshift Spectrum
Credit unload Credit read Loan read
Unload to user space Amazon Redshift Amazon Redshift External Tables External Tables
Credit department Loan department
Amazon S3 Amazon S3 Glue Data Glue Data catalog catalog Self service capability with Amazon Redshift Spectrum
AWS Cloud
Research Read/write
Amazon Redshift DC2 8xlarge Amazon S3
Production
Read Amazon Redshift DC2 8xlarge Amazon S3 Key Redshift features Solutions to our challenges
Amazon Redshift Concurrency Elastic Auto Amazon Spectrum Scaling Resize WLM Redshift Advisor
Saved hours of Solved the Achieved 25% time spent on concurrency issue cost saving Allows DBAs to spend more time on data loading the design and architecture Transparent Enabled data to users sharing across teams and Reduced cost business units Concurrency 400 Scaling 350 performance 300
With Concurrency 250 Scaling feature, similar or better performance is 200 achieved with 50% of
the compute resource 150 Execution time (sec.)
100
50
0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Query 11
RS 16 nodes RS 8 nodes Burst Concurrency = 30 Concurrency 250000 Scaling feature
200000
) ms
Before vs. after 150000 With Concurrency Scaling feature, the query performance 100000 was consistent (did not degrade) as Avg. Elapsed ( Time concurrency increased 50000
0 1 15 50 100 150 200 Concurrency
Burst - 16 dc2.8xlarge No Burst - 16 dc2.8xlarge On-prem - Cost comparison
Monthly cost Cost vs. Performance w/elastic resize 16 nodes 100%
90% $120.62 80% Cost ($) $147 70% $95 60%
50%
40% 675.82 30% Actual Elapsed 335.86 Time (Min.) 20% 693.82 10%
0% 0 100 200 300 400 500 600 700 800 on-prem Redshift on-demand Redshift on-demand 18 nodes 16 nodes 12 nodes w/elastic resize Production implementation
Guardrail Usage Logging and trending monitoring Automatically time out long Build repository for Enable Redshift audit running user queries using historical usage trending logging to monitor Amazon Redshift QMR sensitive data access. For auditing and compliance Lesson learned
• Choose node type wisely DS2 node type saves cost for us in lower environment
• Concurrency Scaling feature applicable to read only queries
• 60-second wait time to initiate Concurrency Scaling feature
• More nodes do not always yield better performance—find your sweet spot!
• Integration with AWS eco system equally important to performance and cost
• User training is important
• Engage AWS team early © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWSLabs on GitHub—Amazon Redshift https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin scripts Collection of utilities for running diagnostics on your cluster Admin views Collection of utilities for managing your cluster, generating schema DDL, and so on Analyze Vacuum utility Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster Column Encoding utility Utility that will apply optimal column encoding to an established schema with data already loaded AWS Big Data Blog—Amazon Redshift
Amazon Redshift Engineering’s Advanced Table Design Playbook https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table- design-playbook-preamble-prerequisites-and-prioritization/ –Zach Christopherson
Top 10 Performance Tuning Techniques for Amazon Redshift https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon- redshift/ –Ian Meyers and Zach Christopherson
Twelve Best Practices for Amazon Redshift Spectrum https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ –Po Hong and Peter Dalton Learn databases with AWS Training and Certification
Resources created by the experts at AWS to help you build and validate database skills 25+ free digital training courses cover topics and services related to databases, including • Amazon Aurora • Amazon ElastiCache • Amazon Neptune • Amazon Redshift • Amazon DocumentDB • Amazon RDS • Amazon DynamoDB
Validate expertise with the new AWS Certified Database— Specialty beta exam Visit aws.training Thank you!
Tony Gibbs, Amazon Web Services Amy Tseng, Fannie Mae
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.