Data Warehouse Designs for Big Data Performance

Dave Beulke Dave Beulke and Associates

Platform: Cross Platform

E10-Wednesday 16-October 2013 9:45-10:45 [email protected] Member of the inaugural IBM DB2 Information Champions One of 45 IBM DB2 Gold Consultant Worldwide Past President of International DB2 Users Group - IDUG Weekly Performance Tips: Best speaker at CMG conference & former TDWI instructor www.DaveBeulke.com

Former Co-Author of certification tests DB2 DBA Certification test IBM certification test Former Columnist for IBM Data Management Magazine

Consulting Teaching Educational Seminars CPU Demand Reduction Guaranteed! DB2 Version 10 Transition DB2 Performance Review DB2 Performance for Java Developers DW & Design Review Data Warehousing Designs for Performance Security Audit & Compliance How to Do a Performance Review DB2 10 Migration Assistance Data Studio and pureQuery DB2 10 Performance IBM White Paper & Redbook

Extensive experience in VLDB , DW design and performance Working with DB2 on z/OS since V1.2 Working with DB2 on LUW since OS/2 Extended Edition Designed/implemented first in 1988 for E.F. Hutton Working with Java for Syspedia since 2001 – Syspedia - Find, understand and integrate your data faster! Data Warehousing - New and old demands

Direct Analysis of the customer’s demographic, location, to tailor advertising Marketing and marketing campaigns to predict profit variability of campaign purchases Cross Selling Analysis of customer’s purchases and behavior to predict their future products desired in a product category Customer Analysis of customer history, company interaction, and services Retention performed to predict customer satisfaction and retention Customer Risk Quantitative analytics to calculate the probabilities of various good and bad events and calculate their business profits/costs Health Analysis of the different drug and physical treatments for conditions, Treatments illnesses and diseases against quality, costs, and outcomes Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc… Financial Market type, company category and financial statements analytics Analytics related to stock pricing, trends and profit probabilities and risk

© Copyright Dave Beulke & Associates [email protected] Page 3 Many Solution Architectures

• Fact and dimension tables

• Ralph Kimball: Centralize DW • One DW serving diverse needs and users

• Bill Immon: Information Factory • Wheel Hub and Spoke concept – Central to Mart

• Hybrid Operational DW BI Solutions • Customized for your business

© Copyright Dave Beulke & Associates [email protected] Page 4 Data Warehousing – Rise of DW Machines

• Massively Parallel Machines • Grid • No SQL like Hadoop, MongoDB, GreenPlum etc…

• Free databases • Open source • Vendors redefining themselves

• Follow the money • Power Electricity • Cooling • Copies of the data

© Copyright Dave Beulke & Associates [email protected] Page 5 Partitioning parallelism reduces time Determine I/O requirements per year/month/week/day • Formula = CPU ms + 2-20 ms per call 5B per year = rows per week = 400,000,000 rows = 400,000 CPU seconds + 800,000- 8,000,000 I/O secs • Elapsed time 222 to 2,222 hours processing each week

• 10 parallel schedules • (2,222 / 4) / 10 = 55.55 hours • Now we have 100’s of CPU Cores available

• SQL Queries • Partitioning encourages query parallelism

© Copyright Dave Beulke & Associates [email protected] Page 6 SQL Features RANK Example: DENSE RANK Example: • OLAP functions SELECT WORKDEPT, SELECT WORKDEPT, EMPNO, • RANK AVG(SALARY+BONUS) LASTNAME, FIRSTNME, AS AVG_TOTAL_SALARY, EDLEVEL • DENSE_RANK RANK() OVER (ORDER BY DENSE_RANK() AVG(SALARY+BONUS) OVER (PARTITION BY • ROW_NUMBER DESC) WORKDEPT • ORDER BY clause AS RANK_AVG_SAL ORDER BY EDLEVEL DESC) FROM BEULKE.EMPLOYEE AS RANK_EDLEVEL • OVER clause GROUP BY WORKDEPT FROM EMPLOYEE • PARTITION BY clause ORDER BY RANK_AVG_SAL ORDER BY WORKDEPT, LASTNAME • RANGE clause • ROW clause • ROLLUP ROW NUMBER Example: SELECT ROW_NUMBER() • CUBE OVER (ORDER BY WORKDEPT, LASTNAME) • Group By or Grouping Sets AS NUMBER, LASTNAME, • DB2 Cube Views SALARY • Virtual cube backed by real structures FROM EMPLOYEE ORDER BY WORKDEPT, LASTNAME • XML and usage

© Copyright Dave Beulke & Associates [email protected] Page 7 New Indexes Opportunities • Separate clustering and partitioning indexes • Clustering is not defined through partitioning index • Partitioning can be done in table definition DDL • PARTITION ENDING AT clause

• Cluster for biggest workload • Data load/inserts/maintenance • SQL activity usually ?10-25%? scanned ORDER_DEPT_NBR_IX • Compliment MQT aggregates • • Clustering for sort elimination • Partition for parallelism/recovery CUST_ORDERS By Date

© Copyright Dave Beulke & Associates [email protected] Page 8 Capture the environment characteristics • Number of CPUs per LPAR available • Virtualization of CPUs VM Ware/Cloud

• Amount of LPAR memory available for workload • Amount of paging that is happening

• Number of disk drives • Amount of I/O to individual drives • Only get 30%-50% of optimum speed

© Copyright Dave Beulke & Associates [email protected] Page 9 Materialized Query Tables • Available on z/OS and LUW • Improved data refresh options • Aggregate via multiple tables • Design and aggregate for users MQT or View

Fact-Yearly Fact-1Q Fact-Month Fact-Week Fact-Daily MQT MQT MQT MQT MQT

© Copyright Dave Beulke & Associates [email protected] Page 10 MQTs - Requirements & Options • Find all totals, sum and SQL functions used in workload • Analyze base tables and their columns - NULLs • Analyze the types of functions used • SUM, AVG, temporary totals, Counts etc….

• Know data change frequency • SQL UPDATE, INSERT, DELETE • What is the schedule of change activity • Do you need a staging table

© Copyright Dave Beulke & Associates [email protected] Page 11 MQT Parameters - Optimization

• Setting the optimization level • Three different ways to achieve SQL optimization

• System level • DFT_QUERYOPT configuration parameter

• BIND level • QUERYOPT optimization-level bind option parameter

• Statement level – SQL statement • SET CURRENT QUERY OPTIMIZATION = host variable or number

© Copyright Dave Beulke & Associates [email protected] Page 13 MQT Parameters - Optimization • Query rewrite considerations • MQT column definitions • Isolation Level • Special Registers • REFRESH AGE • CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION • System, User, All, or None

• Only dynamic queries or during BIND • Query rewrite at the query block level • Columns, Predicates, IN-List, GROUP BY items, Derived columns • EXPLAIN: Table_Type = M

© Copyright Dave Beulke & Associates [email protected] Page 14 MQT Example • Daily sales figures feed MQTs • Create additive MQTs • MQTs created for all analysis comparison points View Daily Weekly Monthly Quarterly Y-T-D Totals Totals Totals Totals Sales • MQTs can be built from other MQTs • Define Quarterly Sales from Monthly Sales • Combine MQTs through Views • Views over region, territory, store id etc…

OLTP Trans Detail from today

© Copyright Dave Beulke & Associates [email protected] Page 17 MQT – 10 to 1000 times improvement! • 5B rows per year–10 per 4k page= ½B pages • MQT aggregates save large amounts of everything • Create aggregates for every possibility

• “On Demand” information Y-T-D View • Sales by department Fact-Yearly Fact-1Q Fact-Month Fact-Week Fact-Daily • Sales by zip code MQT MQT MQT MQT MQT • Sales by time period – day/week/month/quarter/AP • All reporting and analysis areas • Trace usage to create/eliminate aggregates • Total by month ½B I/Os versus 12 I/Os

© Copyright Dave Beulke & Associates [email protected] Page 18 Temporal Data Designs • Example: Create a table, policy_info, that uses a BUSINESS_TIME period CREATE TABLE beulke.policy_info( policy_id CHAR(4) NOT NULL, coverage INT NOT NULL, bus_start DATE NOT NULL, bus_end DATE NOT NULL, PERIOD BUSINESS_TIME(bus_start, bus_end));

• The grain for this temporal table is a DATE • Through the bus_start and bus_end definitions

© Copyright Dave Beulke & Associates [email protected] Page 19

Bi-Temporal Data Designs

• Inserts data along with both time constraints • UPDATE or DELETE can result in many more data rows • Must consider two dimensions!

CREATE TABLE beulke.policy_info( BUSINESS_TIME policy_id CHAR(4) NOT NULL, coverage INT NOT NULL, bus_start DATE NOT NULL, bus_end DATE NOT NULL,

sys__starttime TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW START, sys_endtime TIMESTAMP(12) NOT NULL SYSTEM_TIME GENERATED ALWAYS AS ROW END, PERIOD BUSINESS_TIME(bus_start, bus_end) PERIOD SYSTEM_TIME sys_starttime, sys_endtime);

© Copyright Dave Beulke & Associates [email protected] Page 20 Temporal tables business_time • Can contain status of the business on a certain day!

• Separate business time from system time • Just because you can have bi-temporal does not mean it fits your application • Sometimes fits for DW applications and transactional systems

• Complexity with the SQL & processes to make sure the business situation is valid • Can also use SQL for future business_time

© Copyright Dave Beulke & Associates [email protected] Page 21 Temporal Tables system_time

• SYSTEM_TIME relative to the world • Constant relative across all the processing

• Reflects the latest status of the processing • Transactions can be unique within the overall system • TIMESTAMP (12) WITHOUT OVERLAP definition • Granularity can be associated with any derivative of TIMESTAMP • DATE, DATE & TIME

• SQL processes can manipulate both SYSTEM_TIME and BUSINESS_TIME • Use single service or easily understood interfaces to these period columns

© Copyright Dave Beulke & Associates [email protected] Page 22

Hadoop • Reliable, Scalable, distributed computing • Apache Open Source Community by Doug Cutting

• Why and who uses it • Commodity hardware with petabytes of data • Google MapReduce and Google File System • Yahoo, IBM, Facebook, Amazon • Nothing special container/DASD placement

• Name node with many data nodes • Distributed file system over HTTP • Multiple copies of the data • Billions of data node files

© Copyright Dave Beulke & Associates [email protected] Page 23 Hadoop

• One Name node w/many data nodes • Petabytes of data • Typically 3 copies of data • 3 copies stored – 2 local & 1 remote

• Directory structure of all files • Works at file level

• Name Node works • Job Tracker and Task Tracker • MapReduce Engine -2 steps • Map –chops up problem to worker node • Nodes sub-divide work further Distributed computing with Linux and Hadoop • Reduce brings all answers together http://www.ibm.com/developerworks/linux/library/l-hadoop/

© Copyright Dave Beulke & Associates [email protected] Page 24 Hadoop

• Hadoop is like Teradata? • Hadoop does not use B-trees or Hash partitioning

• Hadoop programming is primitive • MapReduce Matched pairs of data • Map (Key1, Value1)  List (Key2, Value2) • Reduce (Key2, list (Value2))  List answer(Values) • Done on massive parallel scale

• Job tracker & Task Tracker are vital • Schedule & Restart processes

Distributed computing with Linux and Hadoop • Hadoop is version 0.21 http://www.ibm.com/developerworks/linux/library/l-hadoop/

© Copyright Dave Beulke & Associates [email protected] Page 25 NoSQL Access Considerations

• Pig Hive R considerations • SQL is no longer your access language • No more common access for cross platform • Programmer productivity • System integration difficulties

• NoSQL Database are most scanning system • All this data find hash tags/friends/related • Map Reduce considerations

• SQL is the better interface • New everything?

© Copyright Dave Beulke & Associates [email protected] Page 26 27 DW through the Cloud • Ability to use computing resources through Internet • No ownership of resources, tools, or infrastructure • Only can access the information process

An example The New York Times used 100 • Cloud through …… Amazon EC2 • Amazon - Elastic Cloud Computing (EC2) instances and a Hadoop application • Simple Storage Services (S3) to process 4TB of • IBM, HP, Microsoft, Google raw image TIFF data (stored in S3) into 11 million finished • AKA PDFs in the space of 24 hours at a • Platform as a Service computation cost of • Software as a Service SaaS about $240 (not including

bandwidth). • Why - Money!!!!

© Copyright Dave Beulke & Associates [email protected] Page 28 Common Questions • Separate along access patterns • Time period comparisons • Product • Location or Store

Sales Total View

Fact- Fact- Fact- Fact- Fact- Fact- Fact- Furniture Clothes 2010 2011 2012 USA Europe

• Partition or separate table • Combine partitioning, multiple tables with UNION ALL view

© Copyright Dave Beulke & Associates [email protected] Page 29 Big Data Flow • Design encourages concurrent loads and queries • Either through bulk, trickle or custom loads • Automatically aggregates data within hierarchy • Massively parallel processing of all workload aspects

Leverage daily processing aggregates

Fact-Daily Fact-Week Fact-Month Fact-1Q Fact-Yearly MQT MQT MQT MQT MQT

© Copyright Dave Beulke & Associates [email protected] Page 31 zIIP/zAAP engine usage • z/OS has Specialty CPUs • No software license costs • Significantly reduces z/OS costs zIIP zIIP zIIP zIIP • zIIP – Java workloads

• Java Batch as fast as COBOL G G G zAAP • Web & App Server activity

• DB2 Java connections G G G zAAP • Utilities • zAAP – UNIX workloads G G G zAAP • LINUX workloads

© Copyright Dave Beulke & Associates [email protected] Page 32 Daily processing and REFRESH Daily MQT only has 7 days of data

Process or Fact-Daily Daily Daily Claim_Month REFRESH MQT

Detail Weekly

Fact-1Q Monthly Fact-Month Monthly Fact-Week

MQT MQT MQT

Monthly  Data is processed and refreshed within the Weekly MQT  MQTs build from other MQTs Fact-Yearly  The data cascades into the downstream MQTs MQT

© Copyright Dave Beulke & Associates [email protected] Page 33 MQTs series for each year

Fact-Daily Fact-Week Fact-Month Fact-1Q Fact-Yearly MQTMQTs MQT forMQT 2010MQT MQT

Fact-Daily FactEnd-Week– UserFact-Month View Fact-1Q Fact-Yearly MQTMQTs MQT 10 foryearsMQT 2011MQT MQT 22 Billion rows

Fact-Daily Fact-Week Fact-Month Fact-1Q Fact-Yearly MQT MQTsMQT forMQT 2012MQT MQT

© Copyright Dave Beulke & Associates [email protected] Page 34 BLU - Column Databases

MySQL Article Example go to http://www.infinidb.org/downloads

© Copyright Dave Beulke & Associates [email protected] Page 35 Data Warehousing - New and old demands

Direct RulesAnalysis based of analytics,the customer’s Association demographic, Models, Decision location, Trees, to Neural,tailor advertising Naïve, Marketing Bayesianand marketing Networks campaigns to determine to salespredict campaigns profit variability and tactics of campaign purchases Cross Selling Association Rules based are most common methods for cross selling to influence Cross Selling customer’sAnalysis of behavior customer’s and purchases purchases and behavior to predict their future Customer Associations,products desired Rules, in decision a product trees category logistic choice models, Regression based Retention analysis of customer and their company history to predict customer satisfaction Customer andAnalysis retention of customer history, company interaction, and services Retention performed to predict customer satisfaction and retention Customer Risk Classification through Neural, Naïve, Bayesian Networks, Modeling time series Customer Risk andQuantitative exponential analytics exploration to calculateto incorporate the oddities probabilities for adverse of various situations good and bad events and calculate their business profits/costs Health Rules based analytics, Association Models, Decision Trees, Neural, Naïve, TreatmentsHealth BayesianAnalysis Networksof the different to evaluate drug components and physical of personell treatments, location, for conditions, quality, costs, Treatments andillnesses outcomes and diseases against quality, costs, and outcomes Identify transaction anomalies and calculate transaction fraud risk for non Fraud Detection Transaction analytics to calculate transaction fraud risk for non payment, stolen credit card, location dependencies etc… payment, stolen credit card, location dependencies etc… Time based linear modeling to predict risk, value, stock price trends and profit Financial Market type, company category and financial statements analytics Financial probabilities Analytics related to stock pricing, trends and profit probabilities and risk

© Copyright Dave Beulke & Associates [email protected] Page 36 IBM - DB2 V 10.5 BLU • New BLU Acceleration Technology from IBM

• Rethink your Big Data design points

• New Level of Compression

• Dramatic Data Skipping improvements in I/O

• Leveraging Microprocessor improvements

• More In-Memory capabilities

© Copyright Dave Beulke & Associates [email protected] Page 37 IBM DB2 BLU Version 10.5

• New Columnar “BLU” Data store • Adaptive Compression + Compressed ordering • Improved HADR • Rethink Indexing • SIMD – Leveraging Microprocessor Technology • No more need for Complex SQL • Easy Faster training or requirements • Data Skipping

© Copyright Dave Beulke & Associates [email protected] Page 38 Hybrid DW Architectures – Rethink Everything

• Share Centralized – transaction

• Share Nothing – distributed model

• Operational BI – specialized customize • Reduce the copies of the data • Remove restrictions for research • Customized configuration for computing power for analytics

• Specialized Processors one step further • Specialized customized for your business needs • Fraud Detection and Health Care analytics or ???

© Copyright Dave Beulke & Associates [email protected] Page 39

Hybrid DW Architecture Cloud

IDAA Customized Or Selective Kepler Storage Time Dimension Product Dimension Year Product Group Store Dimension Quarter Area Product Directed Month Workloads Region Inventory Analysis Day Store Inventory/sales-ratio Week Server Web XML Smart Operation DW BI OLAP Connections Interface

IBM compatible IBM compatible • Hybrid architecture for matching business needs

© Copyright Dave Beulke & Associates [email protected] Page 40 Checklist for Performance • DB2 SQL continues to lead the industry • Performance advantages of new SQL OLAP

• Temporal & MDC offer some opportunities

• Indexing continues to get better

• Partitioning & parallelism for performance

• MQTs offer HUGE opportunities • Must be substantial CPU and I/O savings • Can be used independently • Data refresh is convenient

© Copyright Dave Beulke & Associates [email protected] Page 41 Checklist for Performance – New Thinking • Leverage new DB2 SQL Columnar Capabilities • Compression, In Memory, SIMD, Skipping • No more designs – Load and Go! • SIMD,

• No more indexes

• Partitioning & parallelism for performance

• Use UID frequency or Time based partitioning

• Include all filtering within SQL

• I/O is most important factor

© Copyright Dave Beulke & Associates [email protected] Page 42

Evaluate my session online: www.idug.org/eu2013/eval

Data Warehouse Designs for Big Data Performance Dave Beulke and Associates [email protected] BLOG: www.DaveBeulke.com E10-Wednesday 16-October 2013 9:45-10:45