VLDB - An Analysis of DB2 at Very Large Scale - D13

Austin Clifford IBM Session Code: 2130 Fri, May 18, 2012 (08:00 AM - 09:00 AM) | Platform: DB2 for LUW - II Presentation Objectives

1) Design & implementation of a VLDB. 2) Benefits and best practice use of DB2 Warehouse features. 3) Ingesting data into VLDB. 4) Approach & considerations to scaling out VLDB as the system grows. 5) Management and problem diagnosis of a VLDB. Disclaimer

●© Copyright IBM Corporation 2012. All rights reserved. ●U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

●THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

•IBM, the IBM logo, ibm.com, and DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml 4

What is a Very Large ?

A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte. 5

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 6

VLDB Mission ● Increasing demands from real-time analytics is placing additional pressure on warehouse systems...... ● Demonstrate the performance and scalability of DB2 and its complimentary products at the Petabyte scale. ● Simulate heavy on-peak analytics in parallel with other essential system functions such as data ingest and and recovery. ● Guide best practices and future product direction. ● Develop techniques for massive scale rapid data generation. 7

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 8

Digital Data 101 – What is a Petabyte?

● 1 Bit = Binary Digit ● 8 Bits = 1 Byte ● 1024 Bytes = 1 Kilobyte ● 1024 Kilobytes = 1 Megabyte ● 1024 Megabytes = 1 Gigabyte ● 1024 Gigabytes = 1 Terabyte ● 1024 Terabytes = 1 Petabyte ● 1024 Petabytes = 1 Exabyte ● 1024 Exabytes = 1 Zettabyte ● 1024 Zettabytes = 1 Yottabyte ● 1024 Yottabytes = 1 Brontobyte ● 1024 Brontobytes = 1 Geopbyte 9

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 10

The Building Blocks We Start with the Storage:

1x = 450GB

1PB of DB Data = Raw Data + RAID + Contingency = 1.6PB

4,608 x = 1.6PB 11

The Building Blocks ● Disks get housed in EXP5000 enclosures ● EXP5000 can hold 16 disks

4608/16 = x 288

● EXP5000 need a DS5300 Storage controller to manage the IO activity (1 DS for 18 EXP)

x 288 = x 16 12

The Building Blocks

●That's the storage done – now we need to drive the system with servers. ●16 clusters ●Smart Analytics guideline of 4 p550 Servers per cluster ●Each cluster attached to 1 DS5300 ●64 servers total

= 4 x 13

The Building Blocks ●The communication between devices ● Juniper Network switches for the copper networks ● IBM SAN switches for the fiber networks

●The server control for the 64 servers is managed by the HMC (Hardware Maintenance Console) 14

Expansion Groups

P550 x 4 EXP5000 x 6

EXP5000 x 12 DS5300 x 1 15

Hardware Summary Full VLDB deployment: ● Smart Analytics like configuration ● 64 p550 Servers ● 16 DS5300 Storage Controllers ● 288 EXP5000 Disk Enclosures ● 4,608 Disks (450GB each -> 1.6PB) ● 8 IBM SAN switches (24p/40p) ● 7 Juniper Network switches (48p) ● 2 HMCs ● 6KM of copper cables ● 2KM of fiber cables ● Occupies 33 fully loaded racks ● Latest ‘Free cooling” designs are incorporated into the lab 16

Free Cooling ● 6 CRAC (Computer Room Air Con) units in the VLDB lab

● Ireland's favourable (?) climate

● Significant savings for Computer room cooling ● As long as outside air temp is below 9.5 degrees C, 100% of the cooling of the room is by fresh air ● Over a full year, 80% of the cooling is fresh air provisioned 17

Software Stack

DB2 (Server 9.7 Fix Pack 5)

IBM General Parallel File System (GPFS™ ) 3.3.0.14

IBM Tivoli System Automation for Multi-Platforms 3.1.1.3

IBM AIX 6.1 TL6 SP5

IBM DS Storage Manager 10.60.G5.16. 18

VLDB in the flesh 19

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 20

Shared Nothing Architectureselect … from table

Tables

Fast Communication Manager Engine Engine Engine Engine … data+log data+log data+log data+log 1 Partition 2 Partition 3 Partition n

Database ● Partitioned ● Database is divided into 504 partitions ● Partitions run on 63 physical nodes (8 partitions per host) ● Each Partition Server has dedicated resources ● Parallel Processing occurs on all partitions: coordinated by DB2 ● Single system image to user and application 21

Shared Nothing Architecture ● Hash Partitioning ● Provides the best parallelism and maximizes I/O capability ● VLDB management (recovery, maintenance, etc.) ● Large scans automatically run in parallel... ● All nodes work together ● Truly scalable performance ● 504 partitions will complete the job in 1/504th of the time ● Queries and Utilities too (backup/restore, load, index build etc) 22

Mapping DB2 Partitions to Servers

FCM FCM FCM FCM

part0 part1 part2 part3 Node 1 Node 2

# db2nodes.cfg •DB2 instance configuration file # sqllib/db2nodes.cfg •All in the instance share this # definition •File stored in DB2 instance sqllib directory 0 node1 0 and shared to other nodes via GPFS 1 node1 1 2 node2 0 • Specifies the host name or the IP address 3 node2 1 of the high speed interconnect for FCM communication ...... 23

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 24

Logical Design

Larger dimension tables are Star and snowflake often snow-flaked

Sampled production database artifacts

Fact tables - composite PK Dimensions tables have or non-unique PK surrogate PKs

Dimension FKs are indexed

VLDB: Relationships inferred All tables are compressed 25

Sample Schema

● Star schema for 4 largest fact tables 26

Partition Group Design ● Partition Groups ● Small dimension tables in Single Database Partition Group (SDPG). ● Fact and large dimension tables are partitioned.

PG_DISJOINT1 PG_DISJOINT2

● Collocation of Facts and largest/frequently joined dimension. ● VLDB - disjoint partition groups used to drive FCM, Table Queues harder. 27

Choosing the Partitioning Key ● Partitioning key ● A subset of the primary key ● DISTRIBUTE BY HASH ● Fewer columns is better ● Surrogate key with high ● cardinality is ideal Collocation ● Possible for tables with same partitioning key ● Collocate Fact with largest commonly joined dimension table ● Consider adding redundant column to Fact PK ● Replicate other dimension tables ● Trade-off between partition balancing and optimal collocation

● Skew ● Avoid skew of more than 10% ● Avoid straggler partition. 28

Check Skew -- rows per partition SELECT dbpartitionnum(date_id) as ‘Partition number’, count(1)*10 as ‘Total # records’ FROM bi_schema.tb_sales_fact TABLESAMPLE SYSTEM 10 GROUP BY dbpartitionnum(date_id)

Partition number Total # records ------1 10,313,750 2 10,126,900 3 9,984,910 4 10,215,840

-- Space allocation per partition Select DBPARTITIONNUM, SUM(DATA_OBJECT_L_SIZE) SIZE_KB from SYSIBMADM.ADMINTABINFO where (tabschema,tabname) = ('THESCHEMA','THETABLE') group by rollup( DBPARTITIONNUM ) order by 2; 29

Physical Design ● Separate tablespaces for: ● Staging Tables ● Indexes ● MQTs ● Table data ● VLDB - typically larger tables have larger pagesize ● Range Partitioning ● Most Fact and large dimension tables are RP. Partitioned indexes. ● Facilitates roll-out/archiving, tablespace etc. ● > 200 ranges - consider MDC instead...

Table 2012Q1 2011Q4 2011Q3 2011Q2 2011Q1 2010Q4 … 2006Q3 Partitions

Automatic Table Space 14 Table Space 13 Table Space 12 Table Space 11 Table Space 10 Table Space 9 … Table Space 1 Storage … 30

MDC and MQT ● Multi-dimensional Clustering ● Why? ● Better query performance through data clustering ● Avoids the need for REORG ● How to choose the dimension columns? ● Columns in filter conditions ● Use combinations with many rows per distinct value

● Materialized Query Tables ● Why? ● Pre-compute costly aggregations ● Replicated tables for non-collocated dimension ● Layering of MQT for dimensional hierarchies ● How to implement successfully? ● Aim for > 10 fold reduction compared to fact table ● Keep base table and MQT stats up to-date ● Use range partitioning for the Fact & MQTs 31

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 32

Intelligent Data Generation

● Workloads and schema. ● 574 Tables ● 7,500 complex SQL statements ● Representative of a cross Data Generator section of real production data warehouses

● Synthetic data ● Referential integrity determined from SQL joins ● Valid result sets for the queries ● Data generated using prime sequences to prevent primary key collisions (patent filed) 33

ETL ● Requirement ● Identify a high speed tool to scale-up the initial base data-set in parallel on each host in isolation. ● Avoid bottlenecks that could impede scalability e.g. network bandwidth. ● Ensure ETL scales-out linearly DB2

Extract Transform LOAD

● Implementation on VLDB ● High Performance Unload → PIPE → LOAD ● HPU facilitates pseudo SQL for implementing key transformations ● Parallel feature “ON HOST” used, repartitioning TARGET KEYS. 34

Scale-up

● Data Generation and load of 1 petabyte of compressed intelligent data to VLDB in 31 days

TBs per Day ● Linear out-scaling 35

30

25

s

e

t 20

y

B 15

a

r

e 10

T

5

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Physical Nodes 35

Ingesting Data ● LOAD ● Fast - Writes formatted DB2 pages directly into the database table with minimal logging. ● Table is not fully available – read access possible. ● Processing is done at the DB2 server machine. ● Specify the DATA BUFFER parameter. ● Import ● Slow - buffered/array partition level inserts are faster. ● DB2 v10 – Ingest ● Continuous, SQL arrays, tables fully available, client based. ● Lightweight ETL (SQL expressions including predicates & casting). ● Can be used against v9.5 and v9.7 databases

D  E A  S INGEST L A C A Transport S Format  C S  C Flush 36

The Big Four

● 574 Tables in total ● > 90% of the total raw data is contained in 4 large fact tables ● The four big fact tables and their associated dimensions: Total Raw Data equivalent 38

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 39 Conducting Workload

Utilities High Concurrency

Query WLM

IUD Monitoring

Warehouse ETL BAR Features 40

What else is being tested?

High Limit Availability Scalability Conditions Admin. Functions

System Expansion & Redistribution Execution Plan Stability

128, 504 and 1000 partitions Problem Access Determination Optim Abnormal Plan Termination Stability 41

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 42

DB2 Performance Monitoring Monitoring table functions. ● Lower overhead that older snapshot based functions ● MON_GET_UNIT_OF_WORK – monitor long running queries Monitoring ● MON_GET_CONNECTION – aggregated measures for connected applications. Useful for checking locks, FCM Functions usage ● On VLDB use MON_GET_MEMORY_POOL function to track instance memory (FCMBP, Utility, Bufferpool, Sortheap) ● MON_GET_TABLE useful for identifying TABLE_SCANS

Db2top ● Quick interactive view ● Obtain data for single partition db2top -P

Optim Performance Manager (OPM) ● Sophisticated graphical browser based monitoring ● Facilitates “in-flight” and retrospective analysis ● Leverages monitoring table functions 43

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 44

Expanding the System ● Add additional nodes/hosts to provide additional capacity. ● Mixed generation systems possible ● Must support same OS and s/w Level. ● Plan ahead ● Plan to expand when capacity + has exceeded 80% ● Use REDISTRIBUTE command ● REDISTRIBUTE PARTITION GROUP PDPG UNIFORM NOT ROLLFORWARD RECOVERABLE DATA BUFFER 300000 PRECHECK ONLY ● Ensure enough disk space to rebuild indexes on largest table. ● Consider using INDEXING MODE DEFERRED ● Extensive testing on VLDB ● System expanded in phases. ● System reduction also included, as migration of systems to more powerful next generation platform may require less host machines. 45

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 46

VLDB Tips ● Consider configuring AUTO-RUNSTATS ● Ensures statistics are up to date ● Create statistics profile. RUNSTATS SET PROFILE ● Include STATISTICS USE PROFILE with LOAD ● Collect distribution statistics in columns where there is skew and predicates use constants ● Collect column group statistics with multiple predicates on the same table e.g. WHERE country = 'IRELAND' AND city = 'DUBLIN' ● Use sampling for very large tables ● Runstats on table scm.tab on key columns with distribution on key columns tablesample system(1) ● Do not configure AUTO-REORG ● When tables contain billions or trillions of rows, it is important to control the maintenance aspect of REORG ● Use MDC to reduce requirement to reorganize large tables. 47

VLDB Tips

● Use ssh for instance remote shell ● DB2SET DB2RMSHD=/bin/ssh ● Particularly important when more than 200 partitions (rsh limit) ● Use multiple coordinators ● Spread client connections across partitions ● prevents over committing resources on any one host e.g. application heap, CPU. ● Use connection concentrator for large no of applications ● Use MAX_CONNECTIONS > MAX_COORDINATORS (fixed) ● Use explicit activation ● db2 activate db myDB ● Use split diagnostics directories ● Avoid contention on single diagnostics log ● Use db2diag -global -merge to merge. 48

VLDB Tips ● Compression ● Enable compression if system is I/O bound (IO Waits) ● Do not enable compression if the system is CPU bound. ● Estimate compression ratios using the administration function ADMIN_GET_TAB_COMPRESS_INFO_V97 ● Backup compression – don't double up ● DB2 V10 Adaptive Compression ● Understand instance memory allocation ● Avoid over-committing memory – can lead to paging ● Particularly important with High Availability ● Monitor with db2mtrk -i -d -v ● Can control with INSTANCE_MEMORY and DATABASE_MEMORY – not always appropriate in partitioned environment ● Materialized Query Tables (MQT) ● For large replicated tables on VLDB use a partitioned MQT to distribute the table replication across all partitions.... 49

Refresh Large Replicated MQT

● Base Table is 1 GB = 10m rows (108 byte rowsize) ● Admin NIC can handle 250MB/s ● REFRESH operation completes in 66 secs

Sending 16 X 1GB in total @ 250MB/s

BASE Durations calculated here are based Host 1 purely on physical network 125MB/s 125MB/s constraints and assume other resources are unlimited – actual duration will

REPL REPL REPL REPL depend on other system specific variables including disk I/O, CPU clock speed etc. REPL REPL REPL REPL

REPL REPL REPL REPL

REPL REPL REPL REPL

Host 2 Host 3 50

Expand System

● Base Table is 1 GB = 10m rows (108 byte rowsize) ● Admin NIC can handle 250MB/s ● REFRESH operation completes in 11 mins

Sending 160 X 1GB in total @ 250MB/s

BASE

Host 1 12.5MB/s 12.5MB/s 12.5MB/s 12.5MB/s

REPL REPL REPL REPL REPL REPL ... REPL REPL REPL REPL REPL REPL + REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL

REPL REPL REPL REPL REPL REPL ... REPL REPL

Host 2 Host 3 Host 4 Host 21 51 Introduce a Distributed MQT ● Now REFRESH of the replicated table is re-written as a query on the distributed MQT instead of the base table on the catalog partition. ● CREATE TABLE distrib_mqt as (SELECT * FROM basetab) DATA INITIALLY DEFERRED REFRESH IMMEDIATE ENABLE QUERY OPTIMIZATION DISTRIBUTE BY HASH(col1) IN TS_BIG ● Each partition now sends approximately 1/160th of the table to each of the other partitions. ● So approximately 1024MB * 8 * 19/20 to be transmitted & received by each host ● REFRESH operation now completes in approximately 1 min.

MQT REPL REPL REPL REPL REPL REPL ... REPL REPL 1/160th REPL REPL REPL REPL + REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL

REPL REPL REPL REPL REPL REPL ... REPL REPL

Host 2 Host 3 Host 4 Host 21 52

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion 53

Conclusion

DB2 VLDB works great on DB2!

DB2 is designed and architected to grow and perform from single partition to 1000 partitions, providing plenty of growth as your business needs require. Austin Clifford IBM [email protected] Session VLDB - An Analysis of DB2 at Very Large Scale - D13 55

Backup Slides 56 Scaleup Implementation NO TRAFFIC NO TRAFFIC BETWEEN SERVERS BETWEEN SERVERS

DataServer1 DataServer2 Logical nodes: 1,2,3 Load from pipe .. partitioned db Logical nodes: 4,5,6 Load from pipe .. partitioned db config mode outputdbnums(4,5,6) ...... config mode outputdbnums(1,2,3)

Load data back to the containers Load data back to the containers

pipe.004 pipe.005 pipe.006 ...... pipe.001 pipe.002 pipe.003 1 2 3 4 5 6

Db2hpu -i instance -f VLDBcontrolfile Db2hpu -i instance -f VLDBcontrolfile - Unload data from containers for local nodes - Unload data from containers for local nodes - update key columns - update key columns ...... - pass data through pipes for LOAD - pass data through pipes for LOAD 57

Prime Sequences

● Prevent key collisions ● Duplicates are very costly during load. ● Avoiding PK collisions essential. ● Nested sequences are unique, but results in skewed values. ● => use cycling sequences

Nested Sequences Cycling Sequences 58

Prime Sequences

● Problem. ● Cycling sequences can hit collision before full cartesian product if constituent columns share common factor.....

● Solution ● Use sequences with prime cardinality.... 59

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values

(MOD (N -1), R) + 1 N = Row Number R = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows 60

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values Col1 (MOD (22 -1), 2) + 1 = 2 22 = Row Number 2 = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows 61

Prime Sequences

Unique Primes 62

Scaleup Fact Table

● Generate a base set of data and then “Scale Up” the rest PART 0 ● Transpose an existing piece of data into a new piece of Scaleup data for the scaleup PART 1 ● Facts and Dimensions ● Facts are range partitioned into 100 parts ● Populate part 0 for each and PART 2 then scaleup to fill the remaining 99

PART 3 63

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj 64

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load 65

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load 66

Performance Monitoring

● System performance ● CPU ● Vmstat, nmon ● System CPU should be <= 25% of User CPU ● I/O waits < 25% ● RunQueue more representative ● I/O ● Disk: iostat ● Network (FCM): netstat, entstat ● Memory ● Svmon, vmstat

VLDB - An Analysis of DB2 at Very Large Scale - D13

Austin Clifford IBM Session Code: 2130 Fri, May 18, 2012 (08:00 AM - 09:00 AM) | Platform: DB2 for LUW - II

Abstract: The Very Large Database project is an exciting and unprecedented initiative to verify the performance and scalability of DB2 and its complimentary products at the very large scale. The trend towards real-time analytics is placing increasing demands on data warehouse systems. The investigations by the team in Dublin includes simulating heavy on-peak analytics in parallel with other essential system functions such as data ingest and backup and recovery. In order to achieve a database of this magnitude, the team have developed and patented innovative techniques for rapid population of customer like data. Valuable insights are being learned and these will feed into product design and best practice recommendations, to ensure that DB2 continues to out pace future customer needs. This presentation will take us through these insights and highlight the key considerations to ensure a successful large scale data warehouse solution.

1

2

Presentation Objectives

1) Design & implementation of a VLDB. 2) Benefits and best practice use of DB2 Warehouse features. 3) Ingesting data into VLDB. 4) Approach & considerations to scaling out VLDB as the system grows. 5) Management and problem diagnosis of a VLDB.

3

Disclaimer

●© Copyright IBM Corporation 2012. All rights reserved. ●U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

●THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

•IBM, the IBM logo, ibm.com, and DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

4 4

What is a Very Large Database?

A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte.

Speaker Bio: Austin is a DB2 Data Warehouse QA Specialist in Dublin Information Management. Prior to joining IBM in 2009 Austin worked as a database consultant in the Banking sector and has 15 years industry experience in Data Modelling, Database Design, Database Administration and design of ETL applications. Austin holds degrees in Engineering and Management Science from University College Dublin.

Austin is the technical lead on the VLDB project since 2010. He works closely with DB2 Best Practices and is a Customer Lab Advocate.

4

5 5

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

This presentation will walk us through the Very Large Database project from inception to the present time.

We'll look at how we built a system of this unprecedented scale from literally the ground up. We'll look at the hardware building blocks, including the vast amount of storage to accommodate a database of this magnitude.

We'll do a brief review of the shared nothing architecture and how this is implemented on a DB2 Partitioned Database..

Next up, we'll turn our attention to the design of the large scale database. We'll look at warehouse best practice and where we intentionally departed from this, for example using disjoint partition groups.

Once we've got a sound foundation (database design), we need to generate and load the vast amount of data. We'll look at the algorithms we development to synthesize intelligent data based on real-life data warehouse artifacts.

Then once we have a fully built and populated system we can drive the complex mixed workload which is representative of the demands on production data warehouses.

Then we'll look at how we monitor the performance of a VLDB and then how to plan and expand the system for future growth.

Finally, we'll look at the insights we gained and useful tips for administering databases of this scale.

5

6 6

VLDB Mission ● Increasing demands from real-time analytics is placing additional pressure on warehouse systems...... ● Demonstrate the performance and scalability of DB2 and its complimentary products at the Petabyte scale. ● Simulate heavy on-peak analytics in parallel with other essential system functions such as data ingest and backup and recovery. ● Guide best practices and future product direction. ● Develop techniques for massive scale rapid data generation.

In Jan 2010, IBM began the process of building a VLDB, according to the mission established in 2009 by Enzo Cialini, STSM, Chief Architect in DB2 SVT.

To achieve a system of this unprecedented scale, it was necessary to build a brand new lab to house the system in the Dublin campus. The system infrastructure build started in January 2010 and the database creation and data generation commenced in June 2010.

6

7 7

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

7

8 8

Digital Data 101 – What is a Petabyte?

● 1 Bit = Binary Digit ● 8 Bits = 1 Byte ● 1024 Bytes = 1 Kilobyte ● 1024 Kilobytes = 1 Megabyte ● 1024 Megabytes = 1 Gigabyte ● 1024 Gigabytes = 1 Terabyte ● 1024 Terabytes = 1 Petabyte ● 1024 Petabytes = 1 Exabyte ● 1024 Exabytes = 1 Zettabyte ● 1024 Zettabytes = 1 Yottabyte ● 1024 Yottabytes = 1 Brontobyte ● 1024 Brontobytes = 1 Geopbyte

VLDB is over 1 PB compressed which is equivalent to several PBs of raw data.

It is currently the biggest DB2 system worldwide for Linux, Unix and Windows.

8

9 9

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

We'll walk through the next 10 slides very quickly...

9

10 10

The Building Blocks We Start with the Storage:

1x = 450GB

1PB of DB Data = Raw Data + RAID + Contingency = 1.6PB

4,608 x = 1.6PB

4,608 spindles!!... RAID-5.... hot spares... during peak workload we can have several disk failures in a week, but the RAID configuration has provided 100% protection against data loss

10

11 11

The Building Blocks ● Disks get housed in EXP5000 enclosures ● EXP5000 can hold 16 disks

4608/16 = x 288

● EXP5000 need a DS5300 Storage controller to manage the IO activity (1 DS for 18 EXP)

x 288 = x 16

288 SAN controllers

11

12 12

The Building Blocks

●That's the storage done – now we need to drive the system with servers. ●16 clusters ●Smart Analytics guideline of 4 p550 Servers per cluster ●Each cluster attached to 1 DS5300 ●64 servers total

= 4 x

64 p550 servers in Smart Analytics 7600 configuration.

1 management, 1 admin, 62 data nodes.

Each configured with 64GB physical memory.

12

13 13

The Building Blocks ●The communication between devices ● Juniper Network switches for the copper networks ● IBM SAN switches for the fiber networks

●The server control for the 64 servers is managed by the HMC (Hardware Maintenance Console)

Dual bonded (2Gbps) FCM network. Separate network for Hardware Management.

13

14 14

Expansion Groups

P550 x 4 EXP5000 x 6

EXP5000 x 12 DS5300 x 1

The system was built in phases (we'll talk about expansion later in the presentation).

Each set of 2 racks constitutes an “expansion group” of which there are 16 in total. (1 additional rack for network switches).

Each expansion group contains 16 EXP5000 drawers of storage, 1 DS5300 SAN controller and 4 P550 servers.

The expansion groups are linked through the FCM interconnect only.

14

15 15

Hardware Summary Full VLDB deployment: ● Smart Analytics like configuration ● 64 p550 Servers ● 16 DS5300 Storage Controllers ● 288 EXP5000 Disk Enclosures ● 4,608 Disks (450GB each -> 1.6PB) ● 8 IBM SAN switches (24p/40p) ● 7 Juniper Network switches (48p) ● 2 HMCs ● 6KM of copper cables ● 2KM of fiber cables ● Occupies 33 fully loaded racks ● Latest ‘Free cooling” designs are incorporated into the lab

15

16 16

Free Cooling ● 6 CRAC (Computer Room Air Con) units in the VLDB lab

● Ireland's favourable (?) climate

● Significant savings for Computer room cooling ● As long as outside air temp is below 9.5 degrees C, 100% of the cooling of the room is by fresh air ● Over a full year, 80% of the cooling is fresh air provisioned

16

17 17

Software Stack

DB2 (Server 9.7 Fix Pack 5)

IBM General Parallel File System (GPFS™ ) 3.3.0.14

IBM Tivoli System Automation for Multi-Platforms 3.1.1.3

IBM AIX 6.1 TL6 SP5

IBM DS Storage Manager 10.60.G5.16.

Testing commended on db297fp1 and we continued testing through the levels. We are now testing the next version of db2.

17

18 18

VLDB in the flesh

This picture shows two rows of racks of the VLDB system.

18

19 19

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Now that we've walked through the building of the VLDB infrastructure, we can turn our attention to the database layer.

First, let's do a quick refresh of the shared nothing paradigm...

19

20 20

Shared Nothing Architectureselect … from table

Tables

Fast Communication Manager Engine Engine Engine Engine … data+log data+log data+log data+log Partition 1 Partition 2 Partition 3 Partition n

Database ● Partitioned Database Model ● Database is divided into 504 partitions ● Partitions run on 63 physical nodes (8 partitions per host) ● Each Partition Server has dedicated resources ● Parallel Processing occurs on all partitions: coordinated by DB2 ● Single system image to user and application

Shared nothing means exactly that – no shared disk, no shared memory or processors. Not unique to DB2 – also used in Teradata, Netezza, Datallegro etc.

On VLDB we have Smart Analytics configuration which separates Data and Log onto separate volumes, one per partition.

20

21 21

Shared Nothing Architecture ● Hash Partitioning ● Provides the best parallelism and maximizes I/O capability ● VLDB management (recovery, maintenance, etc.) ● Large scans automatically run in parallel... ● All nodes work together ● Truly scalable performance ● 504 partitions will complete the job in 1/504th of the time ● Queries and Utilities too (backup/restore, load, index build etc)

Hash partitioning into buckets which are mapped to partitions using a distribution map. Partitioning key is crucial and is discussed later in this presentation.

21

22 22

Mapping DB2 Partitions to Servers

FCM FCM FCM FCM

part0 part1 part2 part3 Node 1 Node 2

# db2nodes.cfg •DB2 instance configuration file # sqllib/db2nodes.cfg •All databases in the instance share this # definition •File stored in DB2 instance sqllib directory 0 node1 0 and shared to other nodes via GPFS 1 node1 1 2 node2 0 • Specifies the host name or the IP address 3 node2 1 of the high speed interconnect for FCM communication ......

Db2nodes.cfg in the instance home sqllib directory maps partitions to their host servers. On VLDB there is 8 db2 partitions (aka logical nodes) per server. The port range for the FCM is reserved for the instance in /etc/services on each host. When db2 database activates an FCM conduit is allocated between each partition of the instance. With a large number of partitions this can consume a large number of ports (from the ephemeral port range).

22

23 23

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Now that we've covered the db2 instance configuration let's look at the approach to designing a Very Large Database...

23

24 24

Logical Design

Larger dimension tables are Star and snowflake often snow-flaked

Sampled production database artifacts

Fact tables - composite PK Dimensions tables have or non-unique PK surrogate PKs

Dimension FKs are indexed

VLDB: Relationships inferred All tables are compressed

Combination of star and snowflake Based on artifacts sampled from real-life data warehouses. No referential integrity constraints – relationships inferred from indexing and joins (deviation from best practice) Most dimension tables have surrogate primary keys (best practice). Fact tables have mixture of composite PK and non-unique PK (best practice). Some larger dimension tables are snow-flaked to reduce redundant data and storage requirement. All tables are compressed

24

25 25

Sample Schema

● Star schema for 4 largest fact tables

25

26 26

Partition Group Design ● Partition Groups ● Small dimension tables in Single Database Partition Group (SDPG). ● Fact and large dimension tables are partitioned.

PG_DISJOINT1 PG_DISJOINT2

● Collocation of Facts and largest/frequently joined dimension. ● VLDB - disjoint partition groups used to drive FCM, Table Queues harder.

Partition Groups Small dimension tables (< 1 m rows) are placed on a single database partition. Fact and large dimension tables are partitioned in partition groups containing 503 partitions across 63 hosts. Facts and largest/most frequently joined dimensions are collocated.

On VLDB, we sometimes deliberately throw a “spanner in the works”... For example, we've included disjoint partition groups contrary to best practice design in order to drive FCM traffic (table queueing).

26

27 27

Choosing the Partitioning Key ● Partitioning key ● A subset of the primary key ● DISTRIBUTE BY HASH ● Fewer columns is better ● Surrogate key with high ● cardinality is ideal Collocation ● Possible for tables with same partitioning key ● Collocate Fact with largest commonly joined dimension table ● Consider adding redundant column to Fact PK ● Replicate other dimension tables ● Trade-off between partition balancing and optimal collocation

● Skew ● Avoid skew of more than 10% ● Avoid straggler partition.

Partitioning key Collocation possible for tables with same partitioning key Data type must effectively match Use table replication for other non-collocated dimensions. Skew Aim for skew of less than 10% Largest skew should be lower rather than larger than average rowcount for the associated table in order to avoid outlier

27

28 28

Check Skew -- rows per partition SELECT dbpartitionnum(date_id) as ‘Partition number’, count(1)*10 as ‘Total # records’ FROM bi_schema.tb_sales_fact TABLESAMPLE SYSTEM 10 GROUP BY dbpartitionnum(date_id)

Partition number Total # records ------1 10,313,750 2 10,126,900 3 9,984,910 4 10,215,840

-- Space allocation per partition Select DBPARTITIONNUM, SUM(DATA_OBJECT_L_SIZE) SIZE_KB from SYSIBMADM.ADMINTABINFO where (tabschema,tabname) = ('THESCHEMA','THETABLE') group by rollup( DBPARTITIONNUM ) order by 2;

On VLDB we typically observe between 5 and 10% skew (for example, we seem more skew on tables with a single column partitioning key with SMALLINT data-type). However, we have the ability to grow (scale-up – discussed later in presentation) tables at individual database partition(s) level in order to smooth out any skew and make 100% use of available diskspace, if required.

28

29 29

Physical Design ● Separate tablespaces for: ● Staging Tables ● Indexes ● MQTs ● Table data ● VLDB - typically larger tables have larger pagesize ● Range Partitioning ● Most Fact and large dimension tables are RP. Partitioned indexes. ● Facilitates roll-out/archiving, tablespace backups etc. ● > 200 ranges - consider MDC instead...

Table 2012Q1 2011Q4 2011Q3 2011Q2 2011Q1 2010Q4 … 2006Q3 Partitions

Automatic Table Space 14 Table Space 13 Table Space 12 Table Space 11 Table Space 10 Table Space 9 … Table Space 1 Storage …

●Range Partitioning ● Most Fact tables and large dimension tables are RP ● Range partitioned by date. ● Data life-cycle management... ● Roll-in, Roll-out options... attach/add partition, partitioned indexes... ● Facilitates tablespace backup where active (hot) tablespaces are backed up more often than inactive (cold) tablespaces. ● VLDB employs less than 100 range partitions except for 4 very big fact tables (> Trillion rows) which have several hundred ranges.

29

30 30

MDC and MQT ● Multi-dimensional Clustering ● Why? ● Better query performance through data clustering ● Avoids the need for REORG ● How to choose the dimension columns? ● Columns in filter conditions ● Use combinations with many rows per distinct value

● Materialized Query Tables ● Why? ● Pre-compute costly aggregations ● Replicated tables for non-collocated dimension ● Layering of MQT for dimensional hierarchies ● How to implement successfully? ● Aim for > 10 fold reduction compared to fact table ● Keep base table and MQT stats up to-date ● Use range partitioning for the Fact & MQTs

MDC facilitates access to data by using multiple dimensions, therefore keeping data access only to selected relevant cells. Clustering by dimensions avoiding the need for reorganization of data (MDC is designed to keep data in order). Block based indexes on each dimension and dimension intersection can result in a significant saving over RID indexes especially during data-roll-in. Single MDC dimension of business day facilitates continuous data roll-in. Reduced locking too as locking is at block rather than row level. Aggregate functions to “coarsify” dimensions. Use monotonic function – INTEGER() versus MONTH() VLDB: Included MDC with varying number of cells: 100, 1000, 10000, 1000000 cells and MDC/RP combinations. MQTs incorporated to pre-compute costly aggregations and joins. REFRESH DEFERRED with and without staging tables. Replicated tables built on non-collocated dimension tables (technique described later in this presentation for refresh of large base table). Layering of MQT used for ROLAP drill-down hierarchies.

30

31 31

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Now that we've designed and created the database, let's take a look at how we generated, loaded and scaled-up (via ETL), the vast amount of data required to achieve a Petabyte milestone...

31

32 32

Intelligent Data Generation

● Workloads and schema. ● 574 Tables ● 7,500 complex SQL statements ● Representative of a cross Data Generator section of real production data warehouses

● Synthetic data ● Referential integrity determined from SQL joins ● Valid result sets for the queries ● Data generated using prime sequences to prevent primary key collisions (patent filed)

We sampled 7,500 complex Select statements from production warehouses, representing a vast array of query constructs and resulting access plan operators. We needed a data generation tool that generates data to simultaneously return realistic data for each SELECT statement and also allow rapid data generation of trillion of rows in a matter of weeks rather than months or years. We developed new algorithms for generating and subsequently scaling up (via ETL) intelligent data, while maintaining referential integrity (preventing orphaned fact rows). A patent was filed in 2010 with the USPTO for these algorithms.

32

33 33

ETL ● Requirement ● Identify a high speed tool to scale-up the initial base data-set in parallel on each host in isolation. ● Avoid bottlenecks that could impede scalability e.g. network bandwidth. ● Ensure ETL scales-out linearly DB2

Extract Transform LOAD

● Implementation on VLDB ● High Performance Unload → PIPE → LOAD ● HPU facilitates pseudo SQL for implementing key transformations ● Parallel feature “ON HOST” used, repartitioning TARGET KEYS.

So, now we have our algorithm for rapid scaleup of data (while preventing collisions and maintaining data integrity).... the next question is how to implement this?.. It boils down to a choice of three....

Ensure ETL scales-out linearly. What we require here is that each host scales-up a table at a constant speed so that ingest rate per host remains constant as the number of hosts increases.

33

34 34

Scale-up

● Data Generation and load of 1 petabyte of compressed intelligent data to VLDB in 31 days

● Linear out-scaling

TBs per Day

35

30

25

s

e

t 20

y

B 15

a

r

e 10

T The chart may look contrived being so linear but this5 is in 0 reality exactly what we observe in terms of scale-up4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 performance due to the collocated ETLs. Number of Physical Nodes

34

35 35

Ingesting Data ● LOAD ● Fast - Writes formatted DB2 pages directly into the database table with minimal logging. ● Table is not fully available – read access possible. ● Processing is done at the DB2 server machine. ● Specify the DATA BUFFER parameter. ● Import ● Slow - buffered/array partition level inserts are faster. ● DB2 v10 – Ingest ● Continuous, SQL arrays, tables fully available, client based. ● Lightweight ETL (SQL expressions including predicates & casting). ● Can be used against v9.5 and v9.7 databases

D  E A  S INGEST L A C A Transport S Format  C S  C Flush

LOAD Load is fast but does not facilitate full access to table. Import/Insert: Slow into large scale partitioned database. Partition level buffered/array inserts offers superior throughput. An alternative approach is LOAD NONRECOVERABLE into staging table and then use collocated INSERT- SELECT into Fact tables. Adjust commit size to tune ingest performance / row locking. DB2 v10 Ingest offers the best of both worlds – high throughput while keeping the tables fully available. Ingest also offers lightweight ETL via SQL expressions including basic predicates and casting. Client based and compatible with down-releases including 9.5 and 9.7 databases.

35

36 36

The Big Four

● 574 Tables in total ● > 90% of the total raw data is contained in 4 large fact tables ● The four big fact tables and their associated dimensions:

36

37

Total Raw Data equivalent

The compression factor here does not include indexes and therefore the ratio is understated.

38 38

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Now that we've got the database populated we can move on to running the workload and testing the system...

38

39 39 Conducting Workload

Utilities High Concurrency

Query WLM

IUD Monitoring

Warehouse ETL BAR Features

Essentially the workload consists of the full data life-cycle for a typical Very Large Data Warehouse.

39

40 40

What else is being tested?

High Limit Availability Scalability Conditions Admin. Functions

System Expansion & Redistribution Execution Plan Stability

128, 504 and 1000 partitions Problem Access Determination Optim Abnormal Plan Termination Stability

Realistic scenarios start with DB2 Best Practice and then widen to examine PMR/APARs.

Limit conditions includes maximum number of ranges 32767, max-number of database partitions 1000, huge – trillion row tables etc. etc.

40

41 41

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Next, let's talk about performance monitoring. Monitoring the performance of a VLDB is important....

41

42 42

DB2 Performance Monitoring Monitoring table functions. ● Lower overhead that older snapshot based functions ● MON_GET_UNIT_OF_WORK – monitor long running queries Monitoring ● MON_GET_CONNECTION – aggregated measures for connected applications. Useful for checking locks, FCM Functions usage ● On VLDB use MON_GET_MEMORY_POOL function to track instance memory (FCMBP, Utility, Bufferpool, Sortheap) ● MON_GET_TABLE useful for identifying TABLE_SCANS

Db2top ● Quick interactive view ● Obtain data for single partition db2top -P

Optim Performance Manager (OPM) ● Sophisticated graphical browser based monitoring ● Facilitates “in-flight” and retrospective analysis ● Leverages monitoring table functions

In addition to OS level system monitoring metrics gathered through svmon, vmstat, netstat, nmon etc., it is important to monitor DB2 performance. The following are some of the approaches adopted on VLDB.....

V9.7 onwards, monitoring table functions. Lower overhead than older snapshot based functions. Facilitates convenient recording of monitoring results in tables via SQL insert, for retrospective analysis.

Use -2 for second parameter to obtain metrics for all partitions.

Even though lower overhead than snapshot functions, on a VLDB there can still be a significant amount of data sent back to the coordinator (e.g. MON_GET_CONNECTION with large number of connections and database partitions) so bear this in mind in particular if incorporating calls to monitoring functions in applications or calling several in parallel. .

42

43 43

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Sooner or later a successful data warehouse will require additional capacity. Indeed, the VLDB system was expanded in phases...

43

44 44

Expanding the System ● Add additional nodes/hosts to provide additional capacity. ● Mixed generation systems possible ● Must support same OS and s/w Level. ● Plan ahead ● Plan to expand when capacity + has exceeded 80% ● Use REDISTRIBUTE command ● REDISTRIBUTE PARTITION GROUP PDPG UNIFORM NOT ROLLFORWARD RECOVERABLE DATA BUFFER 300000 PRECHECK ONLY ● Ensure enough disk space to rebuild indexes on largest table. ● Consider using INDEXING MODE DEFERRED ● Extensive testing on VLDB ● System expanded in phases. ● System reduction also included, as migration of systems to more powerful next generation platform may require less host machines.

Good capacity planning practices can assist in early detection of trends in resource usage. Best practice is to create and document a performance baseline for each workload and a forecast baseline for the next 12 months. Add a data module to expand storage capacity and reduce the data volume per database partition. Or, you can add a user module to increase the capacity of the system to accommodate users. Review your backup and recovery infrastructure to ensure that you can maintain service level objectives after the expansion ETL applications and maintenance scripts will need to be reviewed to accommodate the expanded system, for example Datastage db2 connector.

Must support same OS Level. E.g., add 7600 R2, 7700 modules to Smart Analytics 7600R1 cluster. Additional partitions created on next generation module to take advantage of advancement in processing power.

44

45 45

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

45

46 46

VLDB Tips ● Consider configuring AUTO-RUNSTATS ● Ensures statistics are up to date ● Create statistics profile. RUNSTATS SET PROFILE ● Include STATISTICS USE PROFILE with LOAD ● Collect distribution statistics in columns where there is skew and predicates use constants ● Collect column group statistics with multiple predicates on the same table e.g. WHERE country = 'IRELAND' AND city = 'DUBLIN' ● Use sampling for very large tables ● Runstats on table scm.tab on key columns with distribution on key columns tablesample system(1) ● Do not configure AUTO-REORG ● When tables contain billions or trillions of rows, it is important to control the maintenance aspect of REORG ● Use MDC to reduce requirement to reorganize large tables.

As well as collecting distribution statistics on columns which exhibit skew, it may be appropriate to increase the number of quantile statistics for range predicates on datetime (e.g. those containing sentinels e.g. 31/12/9999) and character string columns.

If you want to reduce runtime for RUNSTATS on a big table, potentially losing some precision, use sampling. The table sample can be made even smaller for big tables. Runtime can be reduced further by collecting statistics on a subset of columns.

46

47 47

VLDB Tips

● Use ssh for instance remote shell ● DB2SET DB2RMSHD=/bin/ssh ● Particularly important when more than 200 partitions (rsh limit) ● Use multiple coordinators ● Spread client connections across partitions ● prevents over committing resources on any one host e.g. application heap, CPU. ● Use connection concentrator for large no of applications ● Use MAX_CONNECTIONS > MAX_COORDINATORS (fixed) ● Use explicit activation ● db2 activate db myDB ● Use split diagnostics directories ● Avoid contention on single diagnostics log ● Use db2diag -global -merge to merge.

Use connection concentrator to reduce load (&context switching) on coordinator.

47

48 48

VLDB Tips ● Compression ● Enable compression if system is I/O bound (IO Waits) ● Do not enable compression if the system is CPU bound. ● Estimate compression ratios using the administration function ADMIN_GET_TAB_COMPRESS_INFO_V97 ● Backup compression – don't double up ● DB2 V10 Adaptive Compression ● Understand instance memory allocation ● Avoid over-committing memory – can lead to paging ● Particularly important with High Availability ● Monitor with db2mtrk -i -d -v ● Can control with INSTANCE_MEMORY and DATABASE_MEMORY – not always appropriate in partitioned environment ● Materialized Query Tables (MQT) ● For large replicated tables on VLDB use a partitioned MQT to distribute the table replication across all partitions....

Understand the variables that affect DB2 memory allocation. For example, in a VLDB, FCM channel and buffer allocation is proportional to the total number of partitions in the instance.

DB2 should use all available system physical memory but not more. Avoid paging, in particular acute sudden paging which could cause agents to be delayed/hung.

48

49 49

Refresh Large Replicated MQT

● Base Table is 1 GB = 10m rows (108 byte rowsize) ● Admin NIC can handle 250MB/s ● REFRESH operation completes in 66 secs

Sending 16 X 1GB in total @ 250MB/s

BASE Durations calculated here are based Host 1 purely on physical network 125MB/s 125MB/s constraints and assume other resources are unlimited – actual duration will

REPL REPL REPL REPL depend on other system specific variables including disk I/O, CPU clock speed etc. REPL REPL REPL REPL

REPL REPL REPL REPL

REPL REPL REPL REPL

Host 2 Host 3

49

50 50

Expand System

● Base Table is 1 GB = 10m rows (108 byte rowsize) ● Admin NIC can handle 250MB/s ● REFRESH operation completes in 11 mins

Sending 160 X 1GB in total @ 250MB/s

BASE

Host 1 12.5MB/s 12.5MB/s 12.5MB/s 12.5MB/s

REPL REPL REPL REPL REPL REPL ... REPL REPL REPL REPL REPL REPL + REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL

REPL REPL REPL REPL REPL REPL ... REPL REPL

Host 2 Host 3 Host 4 Host 21

50

51 51 Introduce a Distributed MQT ● Now REFRESH of the replicated table is re-written as a query on the distributed MQT instead of the base table on the catalog partition. ● CREATE TABLE distrib_mqt as (SELECT * FROM basetab) DATA INITIALLY DEFERRED REFRESH IMMEDIATE ENABLE QUERY OPTIMIZATION DISTRIBUTE BY HASH(col1) IN TS_BIG ● Each partition now sends approximately 1/160th of the table to each of the other partitions. ● So approximately 1024MB * 8 * 19/20 to be transmitted & received by each host ● REFRESH operation now completes in approximately 1 min.

MQT REPL REPL REPL REPL REPL REPL ... REPL REPL 1/160th REPL REPL REPL REPL + REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL REPL

REPL REPL REPL REPL REPL REPL ... REPL REPL

Host 2 Host 3 Host 4 Host 21

51

52 52

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB ● Conclusion

Sooner or later a successful data warehouse will require additional capacity. Indeed, the VLDB system was expanded in phases...

52

53 53

Conclusion

DB2 VLDB works great on DB2!

DB2 is designed and architected to grow and perform from single partition to 1000 partitions, providing plenty of growth as your business needs require.

53

Austin Clifford IBM [email protected] Session VLDB - An Analysis of DB2 at Very Large Scale - D13

54

55 55

Backup Slides

55

56 56 Scaleup Implementation NO TRAFFIC NO TRAFFIC BETWEEN SERVERS BETWEEN SERVERS

DataServer1 DataServer2 Logical nodes: 1,2,3 Load from pipe .. partitioned db Logical nodes: 4,5,6 Load from pipe .. partitioned db config mode outputdbnums(4,5,6) ...... config mode outputdbnums(1,2,3)

Load data back to the containers Load data back to the containers

pipe.004 pipe.005 pipe.006 ...... pipe.001 pipe.002 pipe.003 1 2 3 4 5 6

Db2hpu -i instance -f VLDBcontrolfile Db2hpu -i instance -f VLDBcontrolfile - Unload data from containers for local nodes - Unload data from containers for local nodes - update key columns - update key columns ...... - pass data through pipes for LOAD - pass data through pipes for LOAD

This diagram depicts the selected implementation of the scaleup ETL process using High Performance Unload to Extract the data which is then passed via a named/FIFO pipe directly to db2 Load utility for each db2 partition in parallel. By further restricting the scaleup algorithm to not change the distribution key and thus collocating the ETL on each server, this results in the scaleup process being extremely rapid.

56

57 57

Prime Sequences

● Prevent key collisions ● Duplicates are very costly during load. ● Avoiding PK collisions essential. ● Nested sequences are unique, but results in skewed values. ● => use cycling sequences

Nested Sequences Cycling Sequences

Prevent key collisions Eliminating duplicates during index build phase is very costly compared to bulk load. Can be an order of magnitude slower to delete a small percentage of rows compared to the entire load direct to container... Particularly acute at the VLDB scale when we're loading tens of billions of rows into a single table range. => Avoiding PK collisions essential for high speed data population. Nested sequences does guarantee uniqueness, but results in skewed distribution of values and very sparse fact tables. => use cycling sequences.

57

58 58

Prime Sequences

● Problem. ● Cycling sequences can hit collision before full cartesian product if constituent columns share common factor.....

● Solution ● Use sequences with prime cardinality....

Collisions are much more likely to occur when generating huge datasets. These are caused by cycling sequences sharing a common factor. We need a simple, efficient way to prevent these which use simple arithmetic operators (for performance) and does not require counters etc. to track previously used combinations...

58

59 59

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values

(MOD (N -1), R) + 1 N = Row Number R = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows

So, what if we use prime sequences?

Prime numbers by their very definition do not share a common factor and therefore this guarantees that the cartesian product of the sequences can be reached without encountering a collision.

Furthermore, this simple formula is all we need to calculate the value of the key column for a given row. Also, this formula lends itself to partitioning the generation i.e. a range of rows can be generated independently of another range. This is important, as parallelism is essential to obtaining the throughput required to generate billions of rows quickly....

59

60 60

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values Col1 (MOD (22 -1), 2) + 1 = 2 22 = Row Number 2 = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows

And the same calculation for the 2nd key column...

60

61 61

Prime Sequences

Unique Primes

Next, to make sure that the joins (reversed engineered from the SELECT statements) work we need to propagate the same prime cardinality to all related columns.

We also need to ensure that the prime cardinality is unique among all tables that is propagated to... This check is performed using recursive SELECT using a common table expression (WITH clause).

61

62 62

Scaleup Fact Table

● Generate a base set of data and then “Scale Up” the rest PART 0 ● Transpose an existing piece of data into a new piece of Scaleup data for the scaleup PART 1 ● Facts and Dimensions ● Facts are range partitioned into 100 parts ● Populate part 0 for each and PART 2 then scaleup to fill the remaining 99

PART 3

Even when using the prime cardinality algorithm and partitioning this across multiple parallel threads, the throughput is still not enough. The throughput is still governed by the time taken to generate the non-key columns which are (seeded) randomly generated number/strings etc. depending on data type. Therefore, rather than generating all non-key values from first principles we scale-up.... Scaleup as described reduces the cpu intensive random number generation and more closely approaches pure DISK I/O speed. i.e. much faster.

62

63 63

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

The scaleup algorithm to transpose the sequential keys into the subsequent range is a close variation to that used to generate the sequential keys.

63

64 64

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load

We'll look at the exact implementation of the Extract and Load in the coming slides....

64

65 65

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load

As you can see the key values calculated using the formula are exactly the same as the values which would have been generated by the original formula..

Importantly, the scaleup algorithm can also be partitioned...

65

66 66

Performance Monitoring

● System performance ● CPU ● Vmstat, nmon ● System CPU should be <= 25% of User CPU ● I/O waits < 25% ● RunQueue more representative ● I/O ● Disk: iostat ● Network (FCM): netstat, entstat ● Memory ● Svmon, vmstat

Automating the collection of performance metrics is important to understand the system behavior under normal, peak and offline workload. Establish a baseline and retain these metrics for planning system expansion and performance tuning.

RunQueue.... In a clustered system like the IBM Smart Analytics System with multi server with multi core and multithreaded CPUs and we have to check first, at what level the monitoring tool is calculating 100%: it might be the maximum utilization of all threads but can be scaled down to thread level too. In the latter case a fully utilized 16 threaded system would show a utilization of 1600%. To get an impression regarding the load on a cluster, the length of the runqueue is a good measure: it gives the administrator a good hint on how busy in terms of parallel running jobs the system is.

Netstat: check KB/s on all hosts and that transmission errors < 1% – netstats – I

66