Click to edit Master title style

VLDB - An Analysis of DB2 at Very Large Scale

Austin Clifford IBM DRAFT Session Code: 2130 Fri, May 18, 2012 (08:00 AM - 09:00 AM) | Platform: DB2 for LUW - II Presentation Objectives

1) Design & implementation of a VLDB. 2) Benefits and best practice use of DB2 Warehouse features. 3) Ingesting data into VLDB. 4) Approach & considerations to scaling out VLDB as the system grows. 5) Management and problem diagnosis of a VLDB. Disclaimer

●© Copyright IBM Corporation 2012. All rights reserved. ●U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

●THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

•IBM, the IBM logo, ibm.com, and DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml 4

What is a Very Large ?

A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte. 5

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 6

VLDB Mission

● Real-time analytics is placing increasing demands on data warehouse systems. ● Verify the performance and scalability of DB2 and its complimentary products at the Petabyte scale. ● Simulate heavy on-peak analytics in parallel with other essential system functions such as data ingest and and recovery. ● Guide best practices and future product direction. ● Develop techniques for massive scale rapid data generation. 7

Digital Data 101 – What is a Petabyte?

● 1 Bit = Binary Digit ● 8 Bits = 1 Byte ● 1024 Bytes = 1 Kilobyte ● 1024 Kilobytes = 1 Megabyte ● 1024 Megabytes = 1 Gigabyte ● 1024 Gigabytes = 1 Terabyte ● 1024 Terabytes = 1 Petabyte ● 1024 Petabytes = 1 Exabyte ● 1024 Exabytes = 1 Zettabyte ● 1024 Zettabytes = 1 Yottabyte ● 1024 Yottabytes = 1 Brontobyte ● 1024 Brontobytes = 1 Geopbyte 8

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 9

The Building Blocks We Start with the Storage:

1x = 450GB

1PB of DB Data = Raw Data + RAID + Contingency = 1.6PB

4,608 x = 1.6PB 10

The Building Blocks ● Disks get housed in EXP5000 enclosures ● EXP5000 can hold 16 disks

4608/16 = x 288

● EXP5000 need a DS5300 Storage controller to manage the IO activity (1 DS for 18 EXP)

x 288 = x 16 11

The Building Blocks

●Thats the storage done – now we need to drive the system with servers. ●To maximise the advantages of parallel processing, the 16 Storage controllers & disks are assigned to 1 cluster each with a Smart Analytics guideline of 4 p550 Servers per cluster (64 servers total)

= 4 x 12

The Building Blocks ●The communication between devices takes place via Juniper Network switches for the copper networks and IBM SAN switches for the fiber networks

●The server control for the 64 servers is managed by the HMC (Hardware Maintenance Console) 13

Hardware Summary

● Full VLDB deployment: ● Smart Analytics like configuration ● 64 p550 Servers ● 16 DS5300 Storage Controllers ● 288 EXP5000 Disk Enclosures ● 4,608 Disks (450GB each -> 1.6PB) ● 8 IBM SAN switches (24p/40p) ● 7 Juniper Network switches (48p) ● 2 HMCs ● 6KM of copper cables ● 2KM of fiber cables ● Occupies 33 fully loaded racks ● Latest ‘Free cooling” designs are incorporated into LAB ● Resulting in a predicted saving of 60% of the power required for cooling 14

Where is the system housed?

● The VLDB deployment when racked up, occupies 33 fully populated racks ● At project inception, there was no lab on the Dublin campus that could house the power and cooling requirements ● A brand new lab was built ● Each device and Rack for the VLDB system was delivered individually in its own packaging and had to be unpacked and racked ● Packaging should not be underestimated!! ● The VLDB project filled 7 industrial dumpsters with packaging. 15

Free Cooling

● There are 6 CRAC (Computer Room Air Con) units in the IM Lab ● Irelands favourable (?) climate results in significant savings for Computer room cooling ● As long as outside air temp is below 9.5 degrees C, 100% of the cooling of the room is by fresh air ● Over the full year, 80% of the cooling will be fresh air provisioned 16

Expansion Groups 17

Software Stack

● The following software was installed on the system: ● DB2 (Server 9.7 Fix Pack 5) ● IBM AIX 6.1 TL6 SP5 ● IBM General Parallel File System (GPFS™ ) 3.3.0.14 ● IBM Tivoli System Automation for Multi-Platforms 3.1.1.3 ● IBM DS Storage Manager 10.60.G5.16. 18

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 19

Shared Nothing Architectureselect … from table

Tables

Fast Communication Manager Engine Engine Engine Engine … data+log data+log data+log data+log 1 Partition 2 Partition 3 Partition n

Database ● Partitioned ● Database is divided into 504 partitions ● Partitions run on 63 physical nodes (8 partitions per host) ● Each Partition Server has dedicated resources ● Parallel Processing occurs on all partitions: coordinated by the DBMS ● Single system image to user and application 20

Shared Nothing Architecture ● Hash Partitioning ● Provides the best parallelism and maximizes I/O capability ● VLDB management (recovery, maintenance, etc.) ● Large scans automatically run in parallel... ● All nodes work together ● Truly scalable performance ● As we have 504 partitions, then it should finish in 1/504th of the time ● And not just the queries, but the utilities too (backup/restore, load, index build etc) 21

Mapping DB2 Partitions to Servers

FCM FCM FCM FCM

part0 part1 part2 part3

Node 1 Node 2

# db2nodes.cfg # •DB2 instance configuration file sqllib/db2nodes.cfg •All in the instance share this definition # •File in the DB2 instance directory 0 node1 0 •Sqllib directory located on one node of 1 node1 1 the system 2 node2 0 •GPFS/NFS mounted by all other nodes 3 node2 1 22

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 23

Database Design

● Star and snowflake ● Sampled production database artifacts. ● Dimensional levels and hierarchies. ● Larger dimension tables are typically snow-flaked. ● No referential integrity – relationships inferred. ● Dimensions tables have surrogate PKs ● Fact tables - composite PK or non-unique PK. ● Dimension FKs are indexed. ● All tables are compressed. 24

Database Design

● Star schema for 4 largest fact tables 25

Database Design

● Partition Groups ● Small dimension tables in SDPG. ● Fact and large dimension tables are partitioned. ● Collocation of Facts and largest/frequently joined dimension. ● Disjoint partition groups to drive table queueing. 26

Database Design

● Partitioning key ● A subset of the primary key ● DISTRIBUTE BY HASH ● Fewer columns is better ● Surrogate key with high cardinality is ideal ● Collocation ● Possible for tables with same partitioning key ● Data type must match ● Collocate Fact with largest commonly joined dimension table ● Use table replication for other non-collocated dimensions. ● Trade-off between partition balancing and optimal collocation ● Skew ● Aim for skew of less than 10% ● Avoid straggler partition. 27

Check Skew

-- rows per partition SELECT dbpartitionnum(date_id) as ‘Partition number’, count(1)*10 as ‘Total # records’ FROM bi_schema.tb_sales_fact TABLESAMPLE SYSTEM 10 GROUP BY dbpartitionnum(date_id) ------1 10,313,750 2 10,126,900 3 9,984,910 4 10,215,840

-- Space allocation per partition Select DBPARTITIONNUM, SUM(DATA_OBJECT_L_SIZE) SIZE_KB from SYSIBMADM.ADMINTABINFO where (tabschema,tabname) = ('THESCHEMA','THETABLE') group by rollup( DBPARTITIONNUM ) order by 2; 28

Database Design

● Separate tablespaces for: ● Staging Tables ● Indexes ● MQTs ● Table data ● Individual data partitions in large range partition tables ● Page Size ● On VLDB, tablespaces with all pagesize included (4K,8K,16K,32K). ● Typically larger tables have larger pagesize. ● Range Partitioning ● Most Fact tables and large dimension tables are RP ● Range partitioned by date interval. ● Less that 100 ranges ideal. ● Partitioned (local) indexes. 29

Database Design

● Multi-dimensional Clustering ● MDC with various number of cells ● Performance, less REORG. ● “Coarsify” dimensions. ● Monotonic functions. ● MDC and RP combination ● Careful with the resulting number of cells... ● Materialized Query Tables ● Pre-compute costly aggregations and joins. ● REFRESH DEFERRED. ● Replicated tables for non-collocated dimension. ● Layering of MQT. 30

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 31

Intelligent Data Generation

● Workloads and schema. ● 574 Tables ● 7,500 complex SQL statements ● Representative of a cross Data Generator section of real production data warehouses

● Synthetic data ● Referential integrity determined from SQL joins ● Valid result sets for the queries ● Data generated using prime sequences to prevent primary key collisions (patent pending) 32

Prime Sequences

● Prevent key collisions ● Duplicates are very costly during load. ● Avoiding PK collisions essential. ● Nested sequences are unique, but results in skewed values. ● => use cycling sequences

Nested Sequences Cycling Sequences 33

Prime Sequences

● Problem. ● Cycling sequences can hit collision before full cartesian product if constituent columns share common factor.....

● Solution ● Use sequences with prime cardinality.... 34

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values

(MOD (N -1), R) + 1 N = Row Number R = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows 35

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values Col1 (MOD (22 -1), 2) + 1 = 2 22 = Row Number 2 = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows 36

Prime Sequences

Unique Primes 37

Scaleup Fact Table

● Generate a base set of data and then “Scale Up” the rest PART 0 ● Transpose an existing piece of data into a new piece of Scaleup data for the scaleup PART 1 ● Facts and Dimensions ● Facts are range partitioned into 100 parts ● Populate part 0 for each and PART 2 then scaleup to fill the remaining 99

PART 3 38

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj 39

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load 40

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load 41

ETL

● Requirement ● Identify a high speed tool to scale-up the initial base data-set in parallel on each host in isolation. ● Avoid bottlenecks that could impede scaleability e.g. network bandwidth. ● Ensure ETL scales-out linearly ● Examined three main approaches ● Datastage ● Native DB2 Methods ● Optim High Performance Unload 42

ETL

● Datastage ● Offers sophisticated ETL capabilities ● Access to DB2 partitioning algorithm. ● Slower than collocated HPU->PIPE->LOAD scaleup. ● Native DB2 methods ● LOAD FROM CURSOR ● LOAD is however serialized through the coordinator. ● INSERT-SELECT ● collocated INSERT-SELECT on NLI tables. ● Faster than LOAD FROM CURSOR; slower than the HPU->LOAD ● High Performance Unload ● HPU and DB2 LOAD both facilitate direct access to database containers ● Parallel feature “ON HOST” used, repartitioning TARGET KEYS. ● A 1 Petabyte population milestone in approximately 30 days on 63 hosts. 43 Scaleup Implementation NO TRAFFIC NO TRAFFIC BETWEEN SERVERS BETWEEN SERVERS

DataServer1 DataServer2 Logical nodes: 1,2,3 Load from pipe .. partitioned db Logical nodes: 4,5,6 Load from pipe .. partitioned db config mode outputdbnums(4,5,6) ...... config mode outputdbnums(1,2,3)

Load data back to the containers Load data back to the containers

pipe.004 pipe.005 pipe.006 ...... pipe.001 pipe.002 pipe.003 1 2 3 4 5 6

Db2hpu -i instance -f VLDBcontrolfile Db2hpu -i instance -f VLDBcontrolfile - Unload data from containers for local nodes - Unload data from containers for local nodes - update key columns - update key columns ...... - pass data through pipes for LOAD - pass data through pipes for LOAD 44

Ingesting Data

● LOAD ● The fastest utility for ingesting. ● Table is not fully available. ● COPY YES loads can have impact on tape library. ● Specify the DATA BUFFER parameter. ● Pre-sorted data to improve performance especially for MDC. ● Import/Insert ● Slow into very large scale partitioned database. ● Buffered/Array inserts offers superior throughput. ● Alternatively, LOAD NONRECOVERABLE into staging table and INSERT-SELECT into target table. ● Adjust commit size to tune ingest performance / row locking. 45

Scale-up

● Using HPU and load this is extremely fast

● On one server, getting speeds of: ● 4,043,422 rows/min TBs per Day 35

30

● 25

Full PB would take: s

e

t 20

y

● B 15

7 years on one server a

r

e 10 ● T 1 month on 63 servers 5

● 0 Linear out-scaling 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Physical Nodes 46

The Big Four

● 574 Tables in total ● > 90% of the total raw data is contained in 4 large fact tables ● The four big fact tables and their associated dimensions: Total Raw Data equivalent 48

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 49

Conducting Workload

● The VLDB workload includes - ● Query workload ● ETL ● Buffered inserts via Datastage DB2 Connector – partition level ● Insert-Update-Delete ● Administration activities ● REORG – online, offline, indexes ● RUNSTATS ● DDL – alter/drop/create tablespace/table/index/view/mqt/procedure ● DDL - REFRESH MQT ● ATTACH/DETACH/SET INTEGRITY on range partition tables ● Back and Recovery ● Workload Manager ● E.g. ALTER THRESHOLD WLMBP_WRITE_DML_ROWSREAD WHEN SQLROWSREAD > 1000000 50

What else is being tested?

● Database Partitions ● 128, 240, 472 and 1000 partitions ● System expansion & Redistribution ● System expanded from 20 to 32 hosts, then from 32 to 63 hosts. ● Redistribution stability and performance ● High Availability ● Interrupt & ABTerm ● Optimizer Execution Plan stability ● As database scales. 50TB, 100TB, 250TB, 400TB, 750TB and 1PB ● Under different fix packs (e.g. FP1 versus FP3) ● Manageability and PD Tools. ● Integration with Optim Performance Manager 51

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 52

Performance Monitoring

● System performance ● CPU ● Vmstat, nmon ● System CPU should be <= 25% of User CPU ● I/O waits < 25% ● RunQueue more representative ● I/O ● Disk: Iostat ● Network (FCM): Netstat, entstat ● Memory ● Svmon, vmstat ● VLDB ● Scripts to automate above collection on 60 second interval ● Augments existing topas output in /etc/perf/daily on 5 min interval ● Facilitates retrospective diagnosis. 53

Performance Monitoring

● DB2 ● Monitoring table functions. ● Lower overhead than older snapshot based functions ● MON_GET_UNIT_OF_WORK – monitor long running queries ● MON_GET_CONNECTION – aggregated measures for connected applications. Useful for checking locks. ● On VLDB use MON_GET_MEMORY_POOL function to track instance memory applications (FCMBP, Bufferpool, Sortheap) ● Db2top ● Quick interactive view ● Obtain data for a single partition using db2top -P ● Optim Performance Manager (OPM) ● Sophisticated graphical web monitoring. ● Facilitates retrospective analysis ● Leverages monitoring table functions 54

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 55

Expanding the System

● Add additional nodes to provide additional capacity. ● Mixed generation systems possible ● Must support same OS Level. ● Plan ahead ● Begin planning when growth capacity has reached 60% and is projected to reach 80% within 12 months ● Use REDISTRIBUTE command ● REDISTRIBUTE PARTITION GROUP PDPG UNIFORM NOT ROLLFORWARD RECOVERABLE DATA BUFFER 300000 ● PRECHECK ONLY option available in 97 Fix Pack 5 ● Ensure enough space to rebuild indexes on largest table. ● INDEXING MODE DEFERRED ● Extensive testing on VLDB ● System expanded in phases. 56

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB 57

VLDB Tips

● Configure AUTO-RUNSTATS *Tips* ● Ensures stats are current on all tables, including system catalog ● Create statistics profile. RUNSTATS SET PROFILE ● Include STATISTICS USE PROFILE with LOAD to prevent AUTO- RUNSTATS blocking LOAD ● Use sampling for very large table ● Runstats on table scm.tab on key columns with distribution on key columns tablesample system(1) ● If data distribution is uneven, call RUNSTATS on the biggest partition. ● Do not configure AUTO-REORG ● Instead use MDC to prevent requirement to reorganize large tables. ● Use multiple coordinators ● Spread client connections across partitions ● prevents over committing memory on any one host e.g. sortheap 58

VLDB Tips ● Use ssh for instance remote shell *Tips* ● DB2SET DB2RMSHD=/bin/ssh ● Particularly important when > 200 partitions as this is rsh limit ● Use connection concentrator ● For large number of application. ● Use MAX_CONNECTIONS > MAX_COORDINATORS (fixed) ● Use explicit activation ● db2 activate db myDB ● Use split diagnostics directories ● Avoid contention on single diagnostics log ● Use db2diag -global -merge to merge. ● Use tablespace backup and rebuild utility to restore ● Allows finer grained ● Hot, Warm, Cold data – backup the current data most frequently ● Avoid disproportionately large tablespaces 59

VLDB Tips

● Compression *Tips* ● Enable compression if system is I/O bound (IO Waits) ● Do not enable compression if the system is CPU bound. ● Estimate compression ratios using the administration function ADMIN_GET_TAB_COMPRESS_INFO_V97 ● For optimal compression on big tables use REORG TABLE ... RESETDICTIONARY. ● The dictionary will be based on a sample of the whole table rather than just the first 2MB used with automatic dictionary creation. ● Perform RUNSTATS after REORG operation. ● MQTs ● Use compression on MQTs too. ● Perform RUNSTATS on MQT after compressing ● For large replicated tables on VLDB use a partitioned MQT to distribute the table replication across all partitions.... 60

Refresh Large Replicated MQT

● Base Table is 500MB ● Admin NIC can handle 250MB/s ● Each server can receive/write a max of 200MB/s ● Configuration below takes 4 mins

Sending 1GB in total @ 250MB/s

125MB/s 125MB/s Base

Admin MQT MQT

Data1 Data2 61

Add More Data Nodes

● Refresh MQT ● Base Table is 500MB ● Admin NIC can handle 250MB/s ● Each server can receive/write a max of 200MB/s ● Configuration below takes 8 mins

Sending 2GB in total @ 250MB/s

72.5MB/s 72.5MB/s 72.5MB/s 72.5MB/s Base

Admin MQT MQT MQT MQT

Data1 Data2 Data3 Data4 62

Introduce a Distributed MQT

● Base table now on all servers (DPF) ● Each server now sends 1/5th of the table to each of the other servers ● So 100MB x 4 servers each to be transmitted ● 400MB on each server to be received ● Will take 2 minutes Each server sending 400MB in total @ a potential 250MB/s

200MB/s 200MB/s 200MB/s 200MB/s 200MB/s

Base Base Base Base Base

MQT MQT MQT MQT MQT

Admin Data1 Data2 Data3 Data4 63

VLDB Tips ● Avoid global monitoring snapshot *Tips* ● GET SNAPSHOT FOR... GLOBAL ● Deprecated functionality – may over commit memory ● Instead use monitoring functions. ● MON_GET_CONNECTION, MON_GET_TABLESPACE etc. ● Avoid over committing memory - paging ● Particularly important with High Availability. ● FCM channel and buffer allocation. ● Spread application connections ● Do not exceed the AIX Ephemeral port range ● Number of ports allocated for FCM conduits is - ● ( Number of Partitions X (Number of Partitions – 1) ) / Number of Hosts) ● Avoid running many instances with a large number of partitions ● Avoid having too many tablespaces, too many table ranges. Click to edit Master title style

Austin Clifford IBM [email protected] Session VLDB - An Analysis of DB2 at Very Large Scale

Click to edit Master title style

VLDB - An Analysis of DB2 at Very Large Scale

Austin Clifford IBM DRAFT Session Code: 2130 Fri, May 18, 2012 (08:00 AM - 09:00 AM) | Platform: DB2 for LUW - II

Abstract: The Very Large Database project is an exciting and unprecedented initiative to verify the performance and scalability of DB2 and its complimentary products at the very large scale. The trend towards real-time analytics is placing increasing demands on data warehouse systems. The investigations by the team in Dublin includes simulating heavy on-peak analytics in parallel with other essential system functions such as data ingest and backup and recovery. In order to achieve a database of this magnitude, the team have developed and patented innovative techniques for rapid population of customer like data. Valuable insights are being learned and these will feed into product design and best practice recommendations, to ensure that DB2 continues to out pace future customer needs. This presentation will take us through these insights and highlight the key considerations to ensure a successful large scale data warehouse solution.

1

2

Presentation Objectives

1) Design & implementation of a VLDB. 2) Benefits and best practice use of DB2 Warehouse features. 3) Ingesting data into VLDB. 4) Approach & considerations to scaling out VLDB as the system grows. 5) Management and problem diagnosis of a VLDB.

3

Disclaimer

●© Copyright IBM Corporation 2012. All rights reserved. ●U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

●THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

•IBM, the IBM logo, ibm.com, and DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

4 4

What is a Very Large Database?

A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte.

Speaker Bio: Austin is a DB2 Data Warehouse QA Specialist in Dublin Information Management. Prior to joining IBM in 2009 Austin worked as a database consultant in the Banking sector and has 15 years industry experience in Data Modelling, Database Design, Database Administration and design of ETL applications. Austin holds degrees in Engineering and Management Science from University College Dublin.

Austin is the technical lead on the VLDB project since 2010. He works closely with DB2 Best Practices and is a Customer Lab Advocate.

4

5 5

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

This presentation will walk us through the Very Large Database project from inception to the present time.

We'll look at how we built a system of this unprecedented scale from literally the ground up. We'll look at the hardware building blocks, including the vast amount of storage to accommodate a database of this magnitude.

We'll do a brief review of the shared nothing paradigm and how this is implemented using DB2 Partitioning Feature.

Next up, we'll turn our attention to the design of the large scale database. We'll look at the warehouse best practice and where we intentionally departed from this, for example using disjoint partition groups.

Once we've got a sound foundation (database design), we need to generate and load the vast amount of data. We'll look at the algorithms we development to synthesize intelligent data based on real-life data warehouse artifacts.

Then once we have a fully built and populated system we can drive the complex mixed workload which is representative of the demands on production data warehouses.

Then we'll look at how we monitor the performance of a VLDB and then how to plan and expand the system for future growth.

Finally, we'll look at the insights we gained and useful tips for administering databases of this scale.

5

6 6

VLDB Mission

● Real-time analytics is placing increasing demands on data warehouse systems. ● Verify the performance and scalability of DB2 and its complimentary products at the Petabyte scale. ● Simulate heavy on-peak analytics in parallel with other essential system functions such as data ingest and backup and recovery. ● Guide best practices and future product direction. ● Develop techniques for massive scale rapid data generation.

The VLDB mission was established in 2009 by Enzo Cialini, STSM, Chief Architect in DB2 SVT.

Work commenced in Jan 2010 to build a new lab to house the system in the Dublin campus. The database creation and data generation commenced in June 2010.

6

7 7

Digital Data 101 – What is a Petabyte?

● 1 Bit = Binary Digit ● 8 Bits = 1 Byte ● 1024 Bytes = 1 Kilobyte ● 1024 Kilobytes = 1 Megabyte ● 1024 Megabytes = 1 Gigabyte ● 1024 Gigabytes = 1 Terabyte ● 1024 Terabytes = 1 Petabyte ● 1024 Petabytes = 1 Exabyte ● 1024 Exabytes = 1 Zettabyte ● 1024 Zettabytes = 1 Yottabyte ● 1024 Yottabytes = 1 Brontobyte ● 1024 Brontobytes = 1 Geopbyte

VLDB is over 1 PB compressed which is equivalent to several PBs of raw data.

It is currently the biggest DB2 LUW system worldwide.

7

8 8

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

We'll walk through the next 10 slides very quickly...

8

9 9

The Building Blocks We Start with the Storage:

1x = 450GB

1PB of DB Data = Raw Data + RAID + Contingency = 1.6PB

4,608 x = 1.6PB

4,608 spindles!!... RAID-5.... hot spares... during peak workload we can have several disk failures in a week, but the RAID configuration has provided 100% protection against data loss

9

10 10

The Building Blocks ● Disks get housed in EXP5000 enclosures ● EXP5000 can hold 16 disks

4608/16 = x 288

● EXP5000 need a DS5300 Storage controller to manage the IO activity (1 DS for 18 EXP)

x 288 = x 16

288 SAN controllers

10

11 11

The Building Blocks

●Thats the storage done – now we need to drive the system with servers. ●To maximise the advantages of parallel processing, the 16 Storage controllers & disks are assigned to 1 cluster each with a Smart Analytics guideline of 4 p550 Servers per cluster (64 servers total)

= 4 x

64 p550 servers in Smart Analytics 7600 configuration.

1 management, 1 admin, 62 data nodes.

Each configured with 64GB physical memory.

11

12 12

The Building Blocks ●The communication between devices takes place via Juniper Network switches for the copper networks and IBM SAN switches for the fiber networks

●The server control for the 64 servers is managed by the HMC (Hardware Maintenance Console)

Dual bonded (2Gbps) FCM network. Separate network for Hardware Management.

12

13 13

Hardware Summary

● Full VLDB deployment: ● Smart Analytics like configuration ● 64 p550 Servers ● 16 DS5300 Storage Controllers ● 288 EXP5000 Disk Enclosures ● 4,608 Disks (450GB each -> 1.6PB) ● 8 IBM SAN switches (24p/40p) ● 7 Juniper Network switches (48p) ● 2 HMCs ● 6KM of copper cables ● 2KM of fiber cables ● Occupies 33 fully loaded racks ● Latest ‘Free cooling” designs are incorporated into LAB ● Resulting in a predicted saving of 60% of the power required for cooling

13

14 14

Where is the system housed?

● The VLDB deployment when racked up, occupies 33 fully populated racks ● At project inception, there was no lab on the Dublin campus that could house the power and cooling requirements ● A brand new lab was built ● Each device and Rack for the VLDB system was delivered individually in its own packaging and had to be unpacked and racked ● Packaging should not be underestimated!! ● The VLDB project filled 7 industrial dumpsters with packaging.

14

15 15

Free Cooling

● There are 6 CRAC (Computer Room Air Con) units in the IM Lab ● Irelands favourable (?) climate results in significant savings for Computer room cooling ● As long as outside air temp is below 9.5 degrees C, 100% of the cooling of the room is by fresh air ● Over the full year, 80% of the cooling will be fresh air provisioned

15

16 16

Expansion Groups

The system was built in phases (we'll talk about expansion later in the presentation).

Each set of 2 racks constitutes an “expansion group” of which there are 16 in total. (1 additional rack for network switches).

Each expansion group contains 16 EXP5000 drawers of storage, 1 DS5300 SAN controller and 4 P550 servers.

The expansion groups are linked through the FCM interconnect only.

16

17 17

Software Stack

● The following software was installed on the system: ● DB2 (Server 9.7 Fix Pack 5) ● IBM AIX 6.1 TL6 SP5 ● IBM General Parallel File System (GPFS™ ) 3.3.0.14 ● IBM Tivoli System Automation for Multi-Platforms 3.1.1.3 ● IBM DS Storage Manager 10.60.G5.16.

Testing commended on db297fp1 and we continued testing through the levels. We are now testing the next version of db2.

17

18 18

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Now that we've walked through the building of the VLDB infrastructure, we can turn our attention to the database layer.

First, let's do a quick refresh of the shared nothing paradigm...

18

19 19

Shared Nothing Architectureselect … from table

Tables

Fast Communication Manager Engine Engine Engine Engine … data+log data+log data+log data+log Partition 1 Partition 2 Partition 3 Partition n

Database ● Partitioned Database Model ● Database is divided into 504 partitions ● Partitions run on 63 physical nodes (8 partitions per host) ● Each Partition Server has dedicated resources ● Parallel Processing occurs on all partitions: coordinated by the DBMS ● Single system image to user and application

Shared nothing means exactly that – no shared disk, no shared memory or processors. Not unique to DB2 – also used in Teradata, Netezza, Datallegro etc.

19

20 20

Shared Nothing Architecture ● Hash Partitioning ● Provides the best parallelism and maximizes I/O capability ● VLDB management (recovery, maintenance, etc.) ● Large scans automatically run in parallel... ● All nodes work together ● Truly scalable performance ● As we have 504 partitions, then it should finish in 1/504th of the time ● And not just the queries, but the utilities too (backup/restore, load, index build etc)

Hash partitioning into buckets which are mapped to partitions using a 4K (pre db2v97) and 32K partition map (db297 onwards). Partitioning key is crucial and is discussed later in this presentation.

20

21 21

Mapping DB2 Partitions to Servers

FCM FCM FCM FCM

part0 part1 part2 part3

Node 1 Node 2

# db2nodes.cfg # •DB2 instance configuration file sqllib/db2nodes.cfg •All databases in the instance share this definition # •File in the DB2 instance directory 0 node1 0 •Sqllib directory located on one node of 1 node1 1 the system 2 node2 0 •GPFS/NFS mounted by all other nodes 3 node2 1

Db2nodes.cfg in the instance home sqllib directory maps partitions to their host servers. On VLDB there is 8 db2 partitions (aka logical nodes) per server. The port range for the FCM is reserved for the instance in /etc/services on each host. When db2 database activates an FCM conduit is allocated between each partition of the instance. With a large number of partitions this can consume a large number of ports (from the ephemeral port range).

21

22 22

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Now that we've covered the db2 instance configuration let's look at the approach to designing a Very Large Database...

22

23 23

Database Design

● Star and snowflake ● Sampled production database artifacts. ● Dimensional levels and hierarchies. ● Larger dimension tables are typically snow-flaked. ● No referential integrity – relationships inferred. ● Dimensions tables have surrogate PKs ● Fact tables - composite PK or non-unique PK. ● Dimension FKs are indexed. ● All tables are compressed.

Combination of star and snowflake Based on artifacts sampled from real-life data warehouses. Dimensional levels and hierarchies. Larger dimension tables are typically snow-flaked No referential integrity – relationships inferred from indexing and joins (deviation from best practice) Most dimension tables have surrogate primary keys (best practice) Fact tables have mixture of composite PK and non-unique PK (best practice). Dimension foreign keys are indexes. All tables are compressed Reduces storage requirement for de-normalized schema design. Overall compression ratio of around 3.

23

24 24

Database Design

● Star schema for 4 largest fact tables

24

25 25

Database Design

● Partition Groups ● Small dimension tables in SDPG. ● Fact and large dimension tables are partitioned. ● Collocation of Facts and largest/frequently joined dimension. ● Disjoint partition groups to drive table queueing.

Partition Groups Small dimension tables (< 1 m rows) are placed on a single database partition (SDPG). Fact and large dimension tables are partitioned in partition groups containing 503 partitions across 63 hosts. Facts and largest/most frequently joined dimensions are collocated. Also, disjoint partition groups to drive FCM traffic (table queueing).

25

26 26

Database Design

● Partitioning key ● A subset of the primary key ● DISTRIBUTE BY HASH ● Fewer columns is better ● Surrogate key with high cardinality is ideal ● Collocation ● Possible for tables with same partitioning key ● Data type must match ● Collocate Fact with largest commonly joined dimension table ● Use table replication for other non-collocated dimensions. ● Trade-off between partition balancing and optimal collocation ● Skew ● Aim for skew of less than 10% ● Avoid straggler partition.

Partitioning key A subset of the primary key DISTRIBUTE BY HASH Fewer columns is better Surrogate key with high cardinality is ideal candidate Collocation Collocation possible for tables with same partitioning key Data type must match Collocate Fact with largest commonly joined dimension table Use table replication for other non-collocated dimensions. Trade-off between partition balancing and optimal collocation Skew Aim for skew of less than 10% Deviation from even skew should be lower rather than larger than average rowcount to avoid outlier

26

27 27

Check Skew

-- rows per partition SELECT dbpartitionnum(date_id) as ‘Partition number’, count(1)*10 as ‘Total # records’ FROM bi_schema.tb_sales_fact TABLESAMPLE SYSTEM 10 GROUP BY dbpartitionnum(date_id) ------1 10,313,750 2 10,126,900 3 9,984,910 4 10,215,840

-- Space allocation per partition Select DBPARTITIONNUM, SUM(DATA_OBJECT_L_SIZE) SIZE_KB from SYSIBMADM.ADMINTABINFO where (tabschema,tabname) = ('THESCHEMA','THETABLE') group by rollup( DBPARTITIONNUM ) order by 2;

27

28 28

Database Design

● Separate tablespaces for: ● Staging Tables ● Indexes ● MQTs ● Table data ● Individual data partitions in large range partition tables ● Page Size ● On VLDB, tablespaces with all pagesize included (4K,8K,16K,32K). ● Typically larger tables have larger pagesize. ● Range Partitioning ● Most Fact tables and large dimension tables are RP ● Range partitioned by date interval. ● Less that 100 ranges ideal. ● Partitioned (local) indexes.

●Range Partitioning ● Most Fact tables and large dimension tables are RP ● Range partitioned by date interval. ● VLDB employs less that 100 range partitions except for 4 very big fact tables (> Trillion rows) which have several hundred ranges.

28

29 29

Database Design

● Multi-dimensional Clustering ● MDC with various number of cells ● Performance, less REORG. ● “Coarsify” dimensions. ● Monotonic functions. ● MDC and RP combination ● Careful with the resulting number of cells... ● Materialized Query Tables ● Pre-compute costly aggregations and joins. ● REFRESH DEFERRED. ● Replicated tables for non-collocated dimension. ● Layering of MQT.

Multi-dimensional Clustering MDC with varying number of cells: 100, 1000, 10000, 1000000 cells. Performance, reduced requirement to REORG. Aggregate functions to “coarsify” dimensions. Monotonic functions. MDC and RP combination Careful with the resulting number of cells... Materialized Query Tables MQTs incorporated to pre-compute costly aggregations and joins. REFRESH DEFERRED with and without staging tables. Replicated tables build on non-collocated dimension tables (technique later for refresh of large base table). Layering of MQT used for ROLAP drill-down dimensions.

29

30 30

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Now that we've designed and created the database, let's take a look at how we generated and loaded (via ETL), the vast amount of data required to achieve a Petabyte milestone...

30

31 31

Intelligent Data Generation

● Workloads and schema. ● 574 Tables ● 7,500 complex SQL statements ● Representative of a cross Data Generator section of real production data warehouses

● Synthetic data ● Referential integrity determined from SQL joins ● Valid result sets for the queries ● Data generated using prime sequences to prevent primary key collisions (patent pending)

We sampled 7,500 complex Select statements from real production warehouse, represented a vast array of query constructs and resulting execution plan operators. We looked at existing toolsets for generating intelligent data to satisfy the select statement (i.e. to return a valid result set) and there was no tool that would both satisfy the statement and also allow rapid data generation of trillion of rows in a matter of weeks rather than months or years. We developed new algorithms for generating and subsequently scaling up (via ETL) intelligent data, while maintaining referential integrity (preventing orphaned fact rows). A patent was filed in 2010 for these algorithms.

31

32 32

Prime Sequences

● Prevent key collisions ● Duplicates are very costly during load. ● Avoiding PK collisions essential. ● Nested sequences are unique, but results in skewed values. ● => use cycling sequences

Nested Sequences Cycling Sequences

Prevent key collisions Eliminating duplicates during index build phase is very costly compared to bulk load. Can be an order of magnitude slower to delete a small percentage of rows compared to the entire load direct to container... Problem is particularly acute at the VLDB scale when we're loading tens of billions of rows into a single table range. => Avoiding PK collisions essential for high speed data population. Nested sequences does guarantee uniqueness, but results in skewed distribution of values and very sparse fact tables. => use cycling sequences.

32

33 33

Prime Sequences

● Problem. ● Cycling sequences can hit collision before full cartesian product if constituent columns share common factor.....

● Solution ● Use sequences with prime cardinality....

Collisions are much more likely to occur when generating huge datasets. These are caused by cycling sequences sharing a common factor. We need a simple, efficient way to prevent these which use simple arithmetic operators (for performance) and does not require counters etc. to track previously used combinations...

33

34 34

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values

(MOD (N -1), R) + 1 N = Row Number R = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows

So, what if we use prime sequences?

Prime numbers by their very definition do not share a common factor and therefore this guarantees that the cartesian product of the sequences can be reached without encountering a collision.

Furthermore, this simple formula is all we need to calculate the value of the key column for a given row. Also, this formula lends itself to partitioning the generation i.e. a range of rows can be generated independently of another range. This is important, as parallelism is essential to obtaining the throughput required to generate billions of rows quickly....

34

35 35

Prime Sequences

● Easy algorithm with no need for counters etc ● Just need the ranges for the columns and the row number to determine the key values Col1 (MOD (22 -1), 2) + 1 = 2 22 = Row Number 2 = Range (Cardinality)

● Example: ● Col1 has a range of 2 values ● Col2 has a range of 3 values ● Col3 has a range of 5 values ● Full cartesian product would contain 30 rows

And the same calculation for the 2nd key column...

35

36 36

Prime Sequences

Unique Primes

Next, to make sure that the joins (reversed engineered from the SELECT statements) work we need to propagate the same prime cardinality to all related columns.

We also need to ensure that the prime cardinality is unique among all tables that is propagated to... This check is performed using recursive SELECT using a common table expression (WITH clause).

36

37 37

Scaleup Fact Table

● Generate a base set of data and then “Scale Up” the rest PART 0 ● Transpose an existing piece of data into a new piece of Scaleup data for the scaleup PART 1 ● Facts and Dimensions ● Facts are range partitioned into 100 parts ● Populate part 0 for each and PART 2 then scaleup to fill the remaining 99

PART 3

Even when using the prime cardinality algorithm and partitioning this across multiple parallel (java) threads, the throughput is still not enough. The throughput is still governed by the time taken to generate the non-key columns which are (seeded) randomly generated number/strings etc. depending on data type. Therefore, rather than generating all non-key values from first principles we scale-up.... Scaleup as described reduces the cpu intensive random number generation and more closely approaches pure DISK I/O speed. i.e. much faster.

37

38 38

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

The scaleup algorithm to transpose the sequential keys into the subsequent range is a close variation to that used to generate the sequential keys.

38

39 39

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load

We'll look at the exact implementation of the Extract and Load in the coming slides....

39

40 40

Scaleup

1 2 1 gfd ywu MOD(( L + V – 1), R) + 1 2 3 2 wjk oiu 1 1 3 jkl lwu 2 2 4 ekd ggy 1 3 5 idh isj Extract 2 1 1 wom trh 1 2 2 wkf dhl 2 3 3 ppk hjl 1 1 4 hgd wpw 2 2 5 ipu koj

Load

As you can see the key values calculated using the formula are exactly the same as the values which would have been generated by the original formula..

Importantly, the scaleup algorithm can also be partitioned...

40

41 41

ETL

● Requirement ● Identify a high speed tool to scale-up the initial base data-set in parallel on each host in isolation. ● Avoid bottlenecks that could impede scaleability e.g. network bandwidth. ● Ensure ETL scales-out linearly ● Examined three main approaches ● Datastage ● Native DB2 Methods ● Optim High Performance Unload

So, now we have our algorithm for rapid scaleup of data (while preventing collisions and maintaining data integrity).... the next question is how to implement this?.. It boils down to a choice of three....

Ensure ETL scales-out linearly. What we require here is that each host scales-up a table at a constant speed so that ingest rate per host remains constant as the number of hosts increases.

41

42 42

ETL

● Datastage ● Offers sophisticated ETL capabilities ● Access to DB2 partitioning algorithm. ● Slower than collocated HPU->PIPE->LOAD scaleup. ● Native DB2 methods ● LOAD FROM CURSOR ● LOAD is however serialized through the coordinator. ● INSERT-SELECT ● collocated INSERT-SELECT on NLI tables. ● Faster than LOAD FROM CURSOR; slower than the HPU->LOAD ● High Performance Unload ● HPU and DB2 LOAD both facilitate direct access to database containers ● Parallel feature “ON HOST” used, repartitioning TARGET KEYS. ● A 1 Petabyte population milestone in approximately 30 days on 63 hosts.

Datastage Datastage offers sophisticated job control, meta-data, lineage, restart and dimensional capabilities. Datastage can scale-up data much data faster than LOAD FROM CURSOR as it has access to DB2 partitioning algorithm and can spread the repartitioning across hosts. Datastage is however slower than collocated HPU->PIPE->LOAD scaleup. Native DB2 methods LOAD FROM CURSOR LOAD is however serialized through the coordinator => bottleneck of the NIC on the coordinator. INSERT-SELECT collocated INSERT-SELECT on tables altered with NOT LOGGED INITIALLY. Faster than LOAD FROM CURSOR slower than the HPU->LOAD High Performance Unload HPU and DB2 LOAD both facilitate direct path access to database containers Parallel feature “ON HOST” used, repartitioning TARGET KEYS option during LOAD, which facilitates parallel scale-up through multiple coordinator partitions. A 1 Petabyte population milestone is achievable in approx 30 days on 63 hosts.

42

43 43 Scaleup Implementation NO TRAFFIC NO TRAFFIC BETWEEN SERVERS BETWEEN SERVERS

DataServer1 DataServer2 Logical nodes: 1,2,3 Load from pipe .. partitioned db Logical nodes: 4,5,6 Load from pipe .. partitioned db config mode outputdbnums(4,5,6) ...... config mode outputdbnums(1,2,3)

Load data back to the containers Load data back to the containers

pipe.004 pipe.005 pipe.006 ...... pipe.001 pipe.002 pipe.003 1 2 3 4 5 6

Db2hpu -i instance -f VLDBcontrolfile Db2hpu -i instance -f VLDBcontrolfile - Unload data from containers for local nodes - Unload data from containers for local nodes - update key columns - update key columns ...... - pass data through pipes for LOAD - pass data through pipes for LOAD

This diagram depicts the selected implementation of the scaleup ETL process using High Performance Unload to Extract the data which is then passed via a named/FIFO pipe directly to db2 Load utility for each db2 partition in parallel. By further restricting the scaleup algorithm to not change the distribution key and thus collocating the ETL on each server, this results in the scaleup process being extremely rapid.

43

44 44

Ingesting Data

● LOAD ● The fastest utility for ingesting. ● Table is not fully available. ● COPY YES loads can have impact on tape library. ● Specify the DATA BUFFER parameter. ● Pre-sorted data to improve performance especially for MDC. ● Import/Insert ● Slow into very large scale partitioned database. ● Buffered/Array inserts offers superior throughput. ● Alternatively, LOAD NONRECOVERABLE into staging table and INSERT-SELECT into target table. ● Adjust commit size to tune ingest performance / row locking.

LOAD The fastest utility for loading data. Table is not fully available. Read only is possible. Recoverable (COPY YES) loads can have serious impact on tape library (TSM etc.). Specify the DATA BUFFER parameter as default is often too small. Pre-sorted data will improve load throughput especially for MDC. Import/Insert Slow into very large scale partitioned database. Buffered/Array inserts offers superior throughput. Alternatively, LOAD NONRECOVERABLE into staging table and INSERT-SELECT with collocated inserts. Adjust commit size to tune ingest performance / row locking.

44

45 45

Scale-up

● Using HPU and load this is extremely fast

● On one server, getting speeds of: ● 4,043,422 rows/min

● Full PB would take: ● 7 years on one server ● 1 month on 63 servers TBs per Day ● Linear out-scaling 35

30

25

s

e

t 20

y

B 15

a

r

e 10 The chart may look contrived being so linear but thisT is in reality exactly what we observe in terms of scale-up5 0 performance due to the collocated ETLs. 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Number of Physical Nodes

45

46 46

The Big Four

● 574 Tables in total ● > 90% of the total raw data is contained in 4 large fact tables ● The four big fact tables and their associated dimensions:

46

47

Total Raw Data equivalent

The compression ratio here does not include indexes and therefore the ratio is understated.

48 48

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Now that we've got the database populated we can move on to running the workload and testing the system...

48

49 49

Conducting Workload

● The VLDB workload includes - ● Query workload ● ETL ● Buffered inserts via Datastage DB2 Connector – partition level ● Insert-Update-Delete ● Administration activities ● REORG – online, offline, indexes ● RUNSTATS ● DDL – alter/drop/create tablespace/table/index/view/mqt/procedure ● DDL - REFRESH MQT ● ATTACH/DETACH/SET INTEGRITY on range partition tables ● Back and Recovery ● Workload Manager ● E.g. ALTER THRESHOLD WLMBP_WRITE_DML_ROWSREAD WHEN SQLROWSREAD > 1000000

Essentially the workload consists of the full data life-cycle for a typical Very Large Data Warehouse.

49

50 50

What else is being tested?

● Database Partitions ● 128, 240, 472 and 1000 partitions ● System expansion & Redistribution ● System expanded from 20 to 32 hosts, then from 32 to 63 hosts. ● Redistribution stability and performance ● High Availability ● Interrupt & ABTerm ● Optimizer Execution Plan stability ● As database scales. 50TB, 100TB, 250TB, 400TB, 750TB and 1PB ● Under different fix packs (e.g. FP1 versus FP3) ● Manageability and PD Tools. ● Integration with Optim Performance Manager

Realistic scenarios start with DB2 Best Practice and then widened to examine PMR/APARs.

50

51 51

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Next, let's talk about performance monitoring. Monitoring the performance of a VLDB is challenging but vital....

51

52 52

Performance Monitoring

● System performance ● CPU ● Vmstat, nmon ● System CPU should be <= 25% of User CPU ● I/O waits < 25% ● RunQueue more representative ● I/O ● Disk: Iostat ● Network (FCM): Netstat, entstat ● Memory ● Svmon, vmstat ● VLDB ● Scripts to automate above collection on 60 second interval ● Augments existing topas output in /etc/perf/daily on 5 min interval ● Facilitates retrospective diagnosis.

Automating the collection of performance metrics is important to understand the system behavior under normal, peak and offline workload. Establish a baseline and retain these metrics for planning system expansion and performance tuning.

RunQueue.... In a cluster system like the IBM Smart Analytics System with the multi server with multi core and multithreaded CPUs and we have to check first, at what level the monitoring tool is calculating 100%: it might be the maximum utilization of all threads but can be scaled down to thread level too. In the latter case a fully utilized 16 threaded system would show a utilization of 1600%. In the case of a CPU related bottleneck it might be possible that we see a CPU utilization of around 7% only. But looking into the details – which would be the thread or core level – will show us, that one thread is utilized for 100% percent and the other 15 threads of the 16 thread system are staying idle waiting for tasks. To get an impression regarding the load on a cluster, the length of the runqueue is a good measure: it gives the administrator a good hint on how busy in terms of parallel running jobs the system is.

52

53 53

Performance Monitoring

● DB2 ● Monitoring table functions. ● Lower overhead than older snapshot based functions ● MON_GET_UNIT_OF_WORK – monitor long running queries ● MON_GET_CONNECTION – aggregated measures for connected applications. Useful for checking locks. ● On VLDB use MON_GET_MEMORY_POOL function to track instance memory applications (FCMBP, Bufferpool, Sortheap) ● Db2top ● Quick interactive view ● Obtain data for a single partition using db2top -P ● Optim Performance Manager (OPM) ● Sophisticated graphical web monitoring. ● Facilitates retrospective analysis ● Leverages monitoring table functions

V9.7 onwards, monitoring table functions. Lower overhead that older snapshot based functions MON_GET_UNIT_OF_WORK – monitor long running queries MON_GET_CONNECTION – aggregated performance measures for connected applications. Useful for checking locks. Use -2 for second parameter to obtain metrics for all partitions. On VLDB use MON_GET_MEMORY_POOL function to track instance memory applications (FCMBP, Bufferpool, Sortheap) Db2top Useful for quick interactive view of system performance Monitoring functions more useful for capturing periodic snapshots. Obtain data for a single partition using db2top -P Optim Performance Manager (OPM) Provides sophisticated graphical web based monitoring capability. Captures metrics on period basis for detailed retrospective analysis Leverages monitoring table functions

53

54 54

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

Sooner or later a successful data warehouse will require additional capacity. Indeed, the VLDB system was expanded in phases...

54

55 55

Expanding the System

● Add additional nodes to provide additional capacity. ● Mixed generation systems possible ● Must support same OS Level. ● Plan ahead ● Begin planning when growth capacity has reached 60% and is projected to reach 80% within 12 months ● Use REDISTRIBUTE command ● REDISTRIBUTE PARTITION GROUP PDPG UNIFORM NOT ROLLFORWARD RECOVERABLE DATA BUFFER 300000 ● PRECHECK ONLY option available in 97 Fix Pack 5 ● Ensure enough space to rebuild indexes on largest table. ● INDEXING MODE DEFERRED ● Extensive testing on VLDB ● System expanded in phases.

Good capacity planning practices can assist in early detection of trends in resource usage. You should create and document a performance baseline for each workload and a forecast baseline for the next 12 months. Add a data module to expand storage capacity and reduce the data volume per database partition. Or, you can add a user module to increase the capacity of the system to accommodate users. Review your backup and recovery infrastructure to ensure that you can maintain service level objectives after the expansion ETL applications and maintenance scripts will need to be reviewed to accommodate the expanded system, for example Datastage db2 connector.

Must support same OS Level. E.g., add 7600 R2, 7700 modules to Smart Analytics 7600R1 cluster. Additional partitions created on next generation module.

55

56 56

Agenda ● VLDB Mission ● What is a PetaByte? ● Building a PetaByte System ● Shared Nothing Architecture ● Database Design ● Data Generation & ETL ● Workload & Testing ● Performance Monitoring ● Expanding the System ● Useful Tips for VLDB

56

57 57

VLDB Tips

● Configure AUTO-RUNSTATS *Tips* ● Ensures stats are current on all tables, including system catalog ● Create statistics profile. RUNSTATS SET PROFILE ● Include STATISTICS USE PROFILE with LOAD to prevent AUTO- RUNSTATS blocking LOAD ● Use sampling for very large table ● Runstats on table scm.tab on key columns with distribution on key columns tablesample system(1) ● If data distribution is uneven, call RUNSTATS on the biggest partition. ● Do not configure AUTO-REORG ● Instead use MDC to prevent requirement to reorganize large tables. ● Use multiple coordinators ● Spread client connections across partitions ● prevents over committing memory on any one host e.g. sortheap

57

58 58

VLDB Tips ● Use ssh for instance remote shell *Tips* ● DB2SET DB2RMSHD=/bin/ssh ● Particularly important when > 200 partitions as this is rsh limit ● Use connection concentrator ● For large number of application. ● Use MAX_CONNECTIONS > MAX_COORDINATORS (fixed) ● Use explicit activation ● db2 activate db myDB ● Use split diagnostics directories ● Avoid contention on single diagnostics log ● Use db2diag -global -merge to merge. ● Use tablespace backup and rebuild utility to restore ● Allows finer grained backups ● Hot, Warm, Cold data – backup the current data most frequently ● Avoid disproportionately large tablespaces

58

59 59

VLDB Tips

● Compression *Tips* ● Enable compression if system is I/O bound (IO Waits) ● Do not enable compression if the system is CPU bound. ● Estimate compression ratios using the administration function ADMIN_GET_TAB_COMPRESS_INFO_V97 ● For optimal compression on big tables use REORG TABLE ... RESETDICTIONARY. ● The dictionary will be based on a sample of the whole table rather than just the first 2MB used with automatic dictionary creation. ● Perform RUNSTATS after REORG operation. ● MQTs ● Use compression on MQTs too. ● Perform RUNSTATS on MQT after compressing ● For large replicated tables on VLDB use a partitioned MQT to distribute the table replication across all partitions....

59

60 60

Refresh Large Replicated MQT

● Base Table is 500MB ● Admin NIC can handle 250MB/s ● Each server can receive/write a max of 200MB/s ● Configuration below takes 4 mins

Sending 1GB in total @ 250MB/s

125MB/s 125MB/s Base

Admin MQT MQT

Data1 Data2

60

61 61

Add More Data Nodes

● Refresh MQT ● Base Table is 500MB ● Admin NIC can handle 250MB/s ● Each server can receive/write a max of 200MB/s ● Configuration below takes 8 mins

Sending 2GB in total @ 250MB/s

72.5MB/s 72.5MB/s 72.5MB/s 72.5MB/s Base

Admin MQT MQT MQT MQT

Data1 Data2 Data3 Data4

This is likely scenario in Very Large Data Warehouses. As the size of the “smaller” non-collocated dimension tables get larger and the number of database partitions increase, the more FCM traffic will be forced through the coordinator's FCM adapter. By using the technique describe here of distributing the base table using a partitioned MQT this can be avoided.

61

62 62

Introduce a Distributed MQT

● Base table now on all servers (DPF) th ● Each server now sends 1/5 of the table to each of the other servers ● So 100MB x 4 servers each to be transmitted ● 400MB on each server to be received ● Will take 2 minutes Each server sending 400MB in total @ a potential 250MB/s

200MB/s 200MB/s 200MB/s 200MB/s 200MB/s

Base Base Base Base Base

MQT MQT MQT MQT MQT Admin Data1 Data2 Data3 Data4

62

63 63

VLDB Tips ● Avoid global monitoring snapshot *Tips* ● GET SNAPSHOT FOR... GLOBAL ● Deprecated functionality – may over commit memory ● Instead use monitoring functions. ● MON_GET_CONNECTION, MON_GET_TABLESPACE etc. ● Avoid over committing memory - paging ● Particularly important with High Availability. ● FCM channel and buffer allocation. ● Spread application connections ● Do not exceed the AIX Ephemeral port range ● Number of ports allocated for FCM conduits is - ● ( Number of Partitions X (Number of Partitions – 1) ) / Number of Hosts) ● Avoid running many instances with a large number of partitions ● Avoid having too many tablespaces, too many table ranges.

● Avoid over committing memory - ● FCM channel and buffer allocation is proportional to the total number of partitions in the instance.

63

Click to edit Master title style

Austin Clifford IBM [email protected] Session VLDB - An Analysis of DB2 at Very Large Scale

64