<<

A N T 3 3 4 - R Migrate your to the cloud in record time, featuring Fannie Mae

Tony Gibbs Amy Tseng Sr. Solutions Architect Engineering Manager Amazon Web Services Fannie Mae

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture and concepts Accelerating your data warehouse migration Fannie Mae enterprise data warehouse and data mart migration Agenda Additional resources Open Q&A Are you an Amazon Redshift user? Amazon Redshift: Data warehousing

ANSI SQL—ACID data warehouse Fast, powerful, and simple data warehousing at 1/10 the cost Massively parallel, petabyte scale

Fast Inexpensive Scalable Secure

Columnar storage technology As low as $1,000 per terabyte Resize your cluster up Data encrypted at rest and to improve I/O efficiency and per year, 1/10 the cost and down as your transit; isolate clusters with parallelize queries; of traditional data performance and capacity VPC; manage your own keys data load scales linearly warehouse solutions needs change with AWS KMS © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Cloud

Amazon Amazon AWS Identity Amazon Simple VPC and Access EC2 Workflow Management Service (IAM) OLAP

MPP

PostgreSQL Columnar Amazon Redshift

Amazon Simple AWS Key Amazon Amazon Storage Management Route 53 CloudWatch Service (S3) Service December 2019

February 2013

> 175 significant patches

> 300 significant features Amazon Redshift architecture

Massively parallel, shared nothing SQL clients/BI tools columnar architecture JDBC/ODBC Leader node Leader SQL endpoint node Stores Coordinates parallel SQL processing Compute Compute Compute Compute nodes node node node Local, columnar storage LoadLoad Amazon Executes queries in parallel UnloadQuery ... Redshift 1 2 3 4 N Load, unload, backup, restore Backup Spectrum Restore Amazon Redshift Spectrum nodes Execute queries directly against Amazon S3 Amazon Simple Storage Service (Amazon S3) Terminology and concepts: Columnar

Amazon Redshift uses a columnar architecture for storing data on disk

Goal: reduce I/O for analytics queries

Physically store data on disk by rather than

Only read the column data that is required Columnar architecture: Example

CREATE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt SELECT min(dt) FROM deep_dive;

Row-based storage • Need to read everything • Unnecessary I/O Columnar architecture: Example

CREATE TABLE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt SELECT min(dt) FROM deep_dive;

Column-based storage • Only scan blocks for relevant column Terminology and concepts: Compression

Goals Impact Allow more data to be stored within Allows two to four times more data an Amazon Redshift cluster to be stored within the cluster Improve query performance by decreasing I/O

By default, COPY automatically analyzes and compresses data on first load into an empty table

ANALYZE COMPRESSION is a built-in command that will find the optimal compression for each column on an existing table Compression: Example

CREATE TABLE deep_dive ( aid loc dt aid INT --audience_id ,loc CHAR(3) --location 1 SFO 2017-10-20 ,dt DATE --date 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt Add 1 of 13 different encodings to each column Compression: Example

CREATE TABLE deep_dive ( aid loc dt aid INT ENCODE AZ64 ,loc CHAR(3) ENCODE BYTEDICT 1 SFO 2017-10-20 ,dt DATE ENCODE RUNLENGTH 2 JFK 2017-10-20 ); 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt More efficient compression is due to storing the same data type in the columnar architecture Columns grow and shrink independently Reduces storage requirements Reduces I/O Terminology and concepts: Blocks

Column data is persisted to 1 MB immutable blocks

Blocks are individually encoded with 1 of 13 encodings

A full block can contain millions of values Terminology and concepts: Zone maps

Goal Eliminates unnecessary I/O

In-memory block metadata • Contains per-block min and max values • All blocks automatically have zone maps • Effectively prunes blocks that cannot contain data for a given query Terminology and concepts: Data sorting

Goal Impact Make queries run faster by Enables range-restricted scans increasing the effectiveness of to prune blocks by leveraging zone maps and reducing I/O zone maps

Achieved with the table property SORTKEY defined on one or more columns

Optimal sort key is dependent on: • Query patterns • Business requirements • Data profile Sort key: Example

CREATE TABLE deep_dive ( Add a sort key to one or more columns aid INT --audience_id to physically sort the data on disk ,loc CHAR(3) --location ,dt DATE --date );) SORTKEY (dt, loc);

deep_dive deep_dive (sorted) aid loc dt aid loc dt 1 SFO 2017-10-20 3 SFO 2017-04-01 2 JFK 2017-10-20 4 JFK 2017-05-14 3 SFO 2017-04-01 2 JFK 2017-10-20 4 JFK 2017-05-14 1 SFO 2017-10-20 Zone maps and sorting: Example

SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017'; Unsorted table Sorted by date

MIN: 01-JUNE-2017 MIN: 01-JUNE-2017 MAX: 20-JUNE-2017 MAX: 06-JUNE-2017

MIN: 08-JUNE-2017 MIN: 07-JUNE-2017 MAX: 30-JUNE-2017 MAX: 12-JUNE-2017

MIN: 12-JUNE-2017 MIN: 13-JUNE-2017 MAX: 20-JUNE-2017 MAX: 21-JUNE-2017

MIN: 02-JUNE-2017 MIN: 21-JUNE-2017 MAX: 25-JUNE-2017 MAX: 30-JUNE-2017 Terminology and concepts: Slices

A slice can be thought of like a virtual compute node Facts about slices

• Unit of data partitioning • Each compute node has either 2 or 16 slices • Parallel query processing • Table rows are distributed to slices • A slice processes only its own data Data distribution

Distribution style is a table property which KEY dictates how that table’s data is distributed throughout the cluster • KEY: value is hashed, same value goes to same location (slice) Slice Slice Slice Slice • ALL: full table data goes to the first slice of every node 1 2 3 4 Node 1 Node 2 • EVEN: round robin • AUTO: Combines EVEN and ALL ALL Goals EVEN • Distribute data evenly for parallel processing Slice Slice Slice Slice Slice Slice Slice Slice 1 2 3 4 1 2 3 4 • Minimize data movement Node 1 Node 2 Node 1 Node 2 during query processing Data distribution: Example

CREATE TABLE deep_dive ( Table: deep_dive aid INT --audience_id User Columns System Columns ,loc CHAR(3) --location ,dt DATE --date aid loc dt ins del row ) (EVEN|KEY|ALL|AUTO);

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2 Data distribution: EVEN Example

CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE EVEN; (4, 'JFK', '2017-05-14');

Table: deep_dive Table: deep_dive Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row

Rows: 10 Rows: 01 Rows: 10 Rows: 01

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2 Data distribution: KEY Example #1

CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE KEY DISTKEY (loc); (4, 'JFK', '2017-05-14');

Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row

Rows: 102 Rows: 201 Rows: 0 Rows: 0

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2 Data distribution: KEY Example #2

CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE KEY DISTKEY (aid); (4, 'JFK', '2017-05-14');

Table: deep_dive Table: deep_dive Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row aid loc dt ins del row

Rows: 10 Rows: 01 Rows: 10 Rows: 01

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2 Data distribution: ALL Example

CREATE TABLE deep_dive ( INSERT INTO deep_dive VALUES aid INT --audience_id (1, 'SFO', '2016-09-01'), ,loc CHAR(3) --location (2, 'JFK', '2016-09-14'), ,dt DATE --date (3, 'SFO', '2017-04-01'), ) DISTSTYLE ALL; (4, 'JFK', '2017-05-14');

Table: deep_dive Table: deep_dive User Columns System Columns User Columns System Columns aid loc dt ins del row aid loc dt ins del row

Rows: 13042 Rows: 0 Rows: 13042 Rows: 0

Slice 0 Slice 1 Slice 2 Slice 3

Node 1 Node 2 Summary: Data distribution

DISTSTYLE KEY Goals Optimize JOIN performance between large tables by distributing on columns used in the ON clause Optimize INSERT INTO SELECT performance

Optimize GROUP BY performance SELECT diststyle, skew_rows • The column that is being distributed on should FROM svv_table_info WHERE "table" = 'deep_dive'; have a high cardinality and not cause row skew: diststyle | skew_rows ------+------DISTSTYLE ALL KEY(aid) | 1.07  Ratio between the slice with the Goals most and least number of rows Optimize Join performance with dimension tables Reduces disk usage on small tables • Small and medium size dimension tables (< 3M rows) DISTSTYLE EVEN • If neither KEY or ALL apply DISTSTYLE AUTO • Default Distribution - Combines DISTSTYLE ALL and EVEN © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS migration tooling

AWS Schema Conversion Tool (SCT) converts your commercial database and data warehouse schemas to open-source engines or AWS-native services, such as Amazon Aurora and Amazon Redshift

AWS Database Migration Service (DMS) easily and securely migrates and/or replicate your and data warehouses to AWS AWS Schema Conversion Tool

The AWS Schema Conversion Tool helps automate database schema and code conversion tasks when migrating from source to target database engines Features Create assessment reports for homogeneous/heterogeneous migrations Convert Convert database schema Convert data warehouse schema Convert embedded application code Code browser that highlights places where manual edits are required Source DB AWS SCT Target DB Secure connections to your databases with SSL Service substitutions/ETL modernization to AWS Glue Migrate data to data warehouses using SCT data extractors Optimize schemas in Redshift AWS SCT data extractors

Extract Data from your data warehouse and migrate to Amazon Redshift • Extracts data through local migration agents • Data is optimized for Redshift and saved in local files • Files are loaded to an Amazon S3 bucket (through network or Snowball Edge) and then to Amazon Redshift

Source DW AWS SCT Amazon Amazon S3 Bucket Redshift AWS Database Migration Service

Migrating Migrate between on-premises and AWS databases Migrate between databases to Aws Automated schema conversion

Data replication for zero downtime migration Legacy data warehouse migration tips

When moving from legacy row-based data warehouses • Denormalize tables where it makes sense (predicate columns in the ) • Avoid date dimension tables • Amazon Redshift is efficient with wide tables because of columnar storage and compression When moving from SMP legacy data warehouses • Colocation of tables is required for fast JOINS • Leverage DIST STYLE ALL/AUTO or KEY • Redshift is designed for big data (> 100GB to PB scale) • Smaller datasets consider Amazon Aurora PostgreSQL • Wrap workflows in explicit transactions Leverage Amazon Redshift Stored Procedures for faster migrations • PL/pgSQL stored procedures were added to make porting legacy procedures easier © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Home ownership 64% in United States Liquidity Our customers can exchange loans and securities for cash immediately

Access Loans available to support buying 1 in 4 and renting for most Americans Affordability Support the housing finance system to enable reasonable payments for all Homes financed by Fannie Mae Roles of our data warehouse

System of record Serve analytics, Regulatory and Consolidate legacy research, and financial reports data warehouse data science Data warehouse use cases

External data

Loan data from external business partners and customers

Integrated data store

Business systems External Enterprise data warehouse Transform Business and Load (ETL) intelligence and Transaction Data Store Data mart analytics On-premises EDW context diagram 11 6

Transactional layer master data 4 Data integration/data access layer 12 Transactional 1 data store SOR 5 (trusted source) Operational Supporting business master data Relationship management Loan Securities Commitments Property Product dev. and Party master management Sourcing management Manage the business Product master Delivery and verification NRT/B Issuance and

securitization Securities master /Batch Collateral 2 Enterprise data warehouse management Integrated

Clearance Ttme Loan master and settlement Servicing History management Agreement master Credit loss

management Plan the business Run business the Security Nearreal NRT/B disclosure 3 Property master On and off balance sheet Data marts Market Product Customer Vendor Amortization Pricing TCO Surveillance Collateral HR Risk forecasting market management Finical hierarchies Perf. metrics, acct. Intra-day Finical CLM Finance and fin. Dis. Loss HCD SF CM Metrics Risk/fraud Disclosures portfolio planning operations reporting Organization External data SOR 13 7 Party master 8 Product master 9 Securities master

10 Loan master New challenges

DW consolidation → 6x–20x transactions in the last 18 months Digital transformation → 3 million queries/month, 4x data growth Successful user adoption → user base growth 4000 Concurrency, scalability and time to market We need

Loan Credit

Self Dynamic Data Cross domain High service scaling sharing access concurrency EDW hybrid architecture

Corporate data center AWS Cloud

Upstream applications

IDS Amazon KMS Amazon STS Amazon IAM AWS On-prem host Direct Connect

S3 API Enterprise Data Lake End users EDW AWS Lambda Amazon S3 EDW (Amazon Redshift) Amazon EMR Down Glue Data stream apps catalog

Data Amazon Athena End users Down BI reports mart stream apps BI reports Data mart migration approach

One-time migration or AWS Cloud continuous data replication

DB AWS DMS AWS SCT Amazon Redshift

10 TB+ DB

AWS Snowball Edge Amazon S3 Amazon Redshift Batch ingestion

Amazon S3 Amazon Redshift Data sharing with Amazon Redshift Spectrum

Business users

Redshift Spectrum

Credit unload Credit read Loan read

Unload to user space Amazon Redshift Amazon Redshift External Tables External Tables

Credit department Loan department

Amazon S3 Amazon S3 Glue Data Glue Data catalog catalog Self service capability with Amazon Redshift Spectrum

AWS Cloud

Research Read/write

Amazon Redshift DC2 8xlarge Amazon S3

Production

Read Amazon Redshift DC2 8xlarge Amazon S3 Key Redshift features Solutions to our challenges

Amazon Redshift Concurrency Elastic Auto Amazon Spectrum Scaling Resize WLM Redshift Advisor

Saved hours of Solved the Achieved 25% time spent on concurrency issue cost saving Allows DBAs to spend more time on data loading the design and architecture Transparent Enabled data to users sharing across teams and Reduced cost business units Concurrency 400 Scaling 350 performance 300

With Concurrency 250 Scaling feature, similar or better performance is 200 achieved with 50% of

the compute resource 150 Execution time (sec.)

100

50

0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 Query 11

RS 16 nodes RS 8 nodes Burst Concurrency = 30 Concurrency 250000 Scaling feature

200000

) ms

Before vs. after 150000 With Concurrency Scaling feature, the query performance 100000 was consistent (did not degrade) as Avg. Elapsed ( Time concurrency increased 50000

0 1 15 50 100 150 200 Concurrency

Burst - 16 dc2.8xlarge No Burst - 16 dc2.8xlarge On-prem - Cost comparison

Monthly cost Cost vs. Performance w/elastic resize 16 nodes 100%

90% $120.62 80% Cost ($) $147 70% $95 60%

50%

40% 675.82 30% Actual Elapsed 335.86 Time (Min.) 20% 693.82 10%

0% 0 100 200 300 400 500 600 700 800 on-prem Redshift on-demand Redshift on-demand 18 nodes 16 nodes 12 nodes w/elastic resize Production implementation

Guardrail Usage Logging and trending monitoring Automatically time out long Build repository for Enable Redshift audit running user queries using historical usage trending logging to monitor Amazon Redshift QMR sensitive data access. For auditing and compliance Lesson learned

• Choose node type wisely DS2 node type saves cost for us in lower environment

• Concurrency Scaling feature applicable to read only queries

• 60-second wait time to initiate Concurrency Scaling feature

• More nodes do not always yield better performance—find your sweet spot!

• Integration with AWS eco system equally important to performance and cost

• User training is important

• Engage AWS team early © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWSLabs on GitHub—Amazon Redshift https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin scripts Collection of utilities for running diagnostics on your cluster Admin views Collection of utilities for managing your cluster, generating schema DDL, and so on Analyze Vacuum utility Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster Column Encoding utility Utility that will apply optimal column encoding to an established schema with data already loaded AWS Big Data Blog—Amazon Redshift

Amazon Redshift Engineering’s Advanced Table Design Playbook https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table- design-playbook-preamble-prerequisites-and-prioritization/ –Zach Christopherson

Top 10 Performance Tuning Techniques for Amazon Redshift https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon- redshift/ –Ian Meyers and Zach Christopherson

Twelve Best Practices for Amazon Redshift Spectrum https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ –Po Hong and Peter Dalton Learn databases with AWS Training and Certification

Resources created by the experts at AWS to help you build and validate database skills 25+ free digital training courses cover topics and services related to databases, including • Amazon Aurora • Amazon ElastiCache • Amazon Neptune • Amazon Redshift • Amazon DocumentDB • Amazon RDS • Amazon DynamoDB

Validate expertise with the new AWS Certified Database— Specialty beta exam Visit aws.training Thank you!

Tony Gibbs, Amazon Web Services Amy Tseng, Fannie Mae

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.