Session: A12

Space: The Storage Frontier for Distributed DB2 DBA’s

Jerry Spence National City Corporation

May 10, 2007 9:20 a.m. – 10:20 a.m.

Platform: DB2 for LUW

All Databases reside on storage. Most storage is nothing but boxes of hardware to most people, but the DBA most go beyond the basic and understand how storage is laid out and the components of the storage design and how it relates to the database. This presentation will present storage concepts and how they relate to the database and some considerations for deploying a database on the various storage devices.

1 Objectives:

• The hardware side of storage. Arrays, controllers, HBA’s RAID, SAN, Disks

• The AIX view of Storage. The logical volume manager

• The database view of storage. Tablespaces, Automatic Storage, tables, indexes, pages, extents, pre-fetching

• DPF and the storage design. Using the BCU model

• Back and recovery and storage strategies.

2

1. Overview: Why you need to understand the storage system and it isn't just a black box you can ignore. 2. The hardware components of Storage: Will discuss Server, Network and Storage array components and how they relate. 3. The OS view of storage. Will discuss AIX logical volume manager and how filesystems are created. 4. Ok I have a database to create, what's next? Will discuss how the storage design impacts the database design and some considerations. 5. Multi-partition and shared nothing: Multiple partitions presents challenges in the storage design. Will discuss some considerations when building a multi-partition database. 6. Backup and Recovery and the storage design: How you define your backup and recovery procedure will have a direct impact on storage usage. Will discuss the various strategies and how the related to disk usage.

2 Overview

Storage is a integral part of databases. Databases could not exist with out some media for saving data. Storage comes in many designs and configurations. How storage is used can have a direct impact on the performance of a database. This presentation will give a basic review of storage and how it relates to DB2 databases.

3

3 Storage

4

4 Hardware: The Basics

• Hard drives are composed of many parts, but the key parts are the platter, the spindle and motor, read write heads and the Head Actuator. • The platters are circular disks typically made out of a light aluminum alloy, glass and ceramic materials. The platter is coated with a magnetic material. • Platters are magnetized on both sides. • The platters are separated by spacers and attached to a spindle. The spindle is attached to a motor and when the motor spins, the spindle and platters also spin in unison.

5

5 Hardware: The Basics

• Read/Write heads as the name infers reads and writes data from the disk platters. • There is generally one head per platter side • The heads fly just above the platter surface. Some distances are as little as 3 nanometres(3 billionths of a meter). • The read and write heads have involved over time from the magnetic coil to GMR (Giant Magnetorestistive) GMR is very sensitive.

6

6 7

7 Hardware: The Basics

• The read write heads are mounted to a actuator assembly or head assembly. • The heads are attached to a a device called a slider. • The slider’s function is to support the head and hold it in the correct location over the platter.

8

8 Inside the Disk Drive

9

9 Hardware: How data is stored

• Platters are broken down into tracks (concentric rings) and sectors or blocks. A block or sector is generally the smallest addressable unit. • Data is stored or read magnetically via the read/write heads on the surface of the platters.

10

10 The Platter

11

11 Hardware: Terms

• Seek Time measures the time it takes to move the read/write heads to tracks on the platters. • Rotational Delay is how long it takes for the read/write heads to move to the sector or block within a track. • Transfer time is the time take to actually move the data to and from the disk. • Areal density or bit density is the amount of data that can be packed on to a storage medium. Gigabits per square inch.

12

12 Hardware: Terms

• SATA - Serial Advanced Technology Advancement Based on serial signaling technology. Cables are thinner and more flexible. • - High Speed • Lun - Logical Unit Number • Controller Software - Software that manages devices. • Mirroring - Raid 1 . Drives mirrored. Two writes. • Disk striping - Spreading data across many disks.

13 http://www.snia.org/education/dictionary/ very good site for network and storage terms

13 Hardware: Storage Arrays

• Simply stated, storage arrays are large boxes that hold many hard disks. • IBM, HP, EMC, SUN, Hitachi to name a few offer storage arrays in different sizes and performance ranges. • Storage arrays can be scalable to the multi terabyte range. • Storage arrays include memory cache and come in different sizes.

14

14 Storage Array

15

15 Hardware: RAID

• Raid stand for inexpensive arrays of disks. • Provides a method to access multiple disks as if they were one large disk • Data can be spread across multiple disk improving performance because more read/write heads are involved • Reliability can be increased with certain RAID types. • There are multiple RAID Types: 0, 1, 1+0 , 2, 3, 4, 5, 6..ect.

16

* RAID-0: This technique has striping but no redundancy of data. It offers the best performance but no fault-tolerance. * RAID-1: This type is also known as disk mirroring and consists of at least two drives that duplicate the storage of data. There is no striping. Read performance is improved since either disk can be read at the same time. Write performance is the same as for single disk storage. RAID-1 provides the best performance and the best fault-tolerance in a multi-user system. * RAID-2: This type uses striping across disks with some disks storing error checking and correcting (ECC) information. It has no advantage over RAID-3. * RAID-3: This type uses striping and dedicates one drive to storing parity information. The embedded error checking (ECC) information is used to detect errors. Data recovery is accomplished by calculating the exclusive OR (XOR) of the information recorded on the other drives. Since an I/O operation addresses all drives at the same time, RAID-3 cannot overlap I/O. For this reason, RAID-3 is best for single-user systems with long record applications. * RAID-4: This type uses large stripes, which means you can read records from any single drive. This allows you to take advantage of overlapped I/O for read operations. Since all write operations have to update the parity drive, no I/O overlapping is possible. RAID-4 offers no advantage over RAID-5. * RAID-5: This type includes a rotating parity array, thus addressing the write limitation in RAID-4. Thus, all read and write operations can be overlapped. RAID-5 stores parity information but not redundant data (but parity information can be used to reconstruct data). RAID-5 requires at least three and usually five disks for the array. It's best for multi-user systems in which performance is not critical or which do few write operations. * RAID-6: This type is similar to RAID-5 but includes a second parity scheme that is distributed across different drives and thus offers extremely high fault- and drive-failure tolerance. * RAID-7: This type includes a real-time embedded as a controller, caching via a high-speed bus, and other characteristics of a stand-alone computer. One vendor offers this system. * RAID-10: Combining RAID-0 and RAID-1 is often referred to as RAID-10, which offers higher performance than RAID-1 but at much higher cost. There are two subtypes: In RAID-0+1, data is organized as stripes across multiple disks, and then the striped disk sets are mirrored. In RAID-1+0, the data is mirrored and the mirrors are striped. * RAID-50 (or RAID-5+0): This type consists of a series of RAID-5 groups and striped in RAID-0 fashion to improve RAID-5 performance without reducing data protection. * RAID-53 (or RAID-5+3): This type uses striping (in RAID-0 style) for RAID-3's virtual disk blocks. This offers higher performance than RAID-3 but at much higher cost. * RAID-S (also known as Parity RAID): This is an alternate, proprietary method for striped parity RAID from EMC Symmetrix that is no longer in use on current equipment. It appears to be similar to RAID-5 with some performance enhancements as well as the enhancements that come from having a high-speed disk cache on the disk array.

16 Raid 0

17

RAID 0 implements block striping, where data is broken into logical blocks and is striped across several drives. Unlike other RAID levels, there is no facility for redundancy. In the event of a disk failure, data is lost. In block striping, the total disk capacity is equivalent to the sum of the capacities of all drives in the array. This combination of drives appears to the system as a single logical drive. RAID 0 provides the highest performance. It is fast because data can be simultaneously transferred to or from every disk in the array. Furthermore, read/writes to separate drives can be processed concurrently.

17 Raid 1

18

RAID 1 implements disk mirroring, where a copy of the same data is recorded onto two drives. By keeping two copies of data on separate disks, data is protected against a disk failure. If, at any time, a disk in the RAID 1 array fails, the remaining good disk (copy) can provide all of the data needed, thus preventing downtime. In disk mirroring, the total usable capacity is equivalent to the capacity of one drive in the RAID 1 array. Thus, combining two 1-Gbyte drives, for example, creates a single logical drive with a total usable capacity of 1 Gbyte. This combination of drives appears to the system as a single logical drive.

18 RAID 0 + 1

19

RAID 1+0 combines RAID 0 and RAID 1 to offer mirroring and disk striping. Using RAID 1+0 is a time-saving feature that enables you to configure a large number of disks for mirroring in one step. It is not a standard RAID level option that you can select; it does not appear in the list of RAID level options supported by the controller. If four or more disk drives are chosen for a RAID 1 logical drive, RAID 1+0 is performed automatically

19 RAID 5

20

RAID 5 implements multiple-block striping with distributed parity. This RAID level offers redundancy with the parity information distributed across all disks in the array. Data and its parity are never stored on the same disk. In the event that a disk fails, original data can be reconstructed using the parity information and the information on the remaining disks.

20 Hardware: SAN

• A high performance network whose primary purpose is to enable storage devices to communicate with computer systems and each other. • Can use fibre channel or Ethernet or any other type of interconnect technology • Disk, tapes, Raid sub systems, file servers can be connected to a san • SAN, support disk mirroring, backup and restore, sharing of data among different servers.

21

21 Hardware: SAN

• SAN provides universal connectivity • Any to any connectivity between storage and servers

22 http://www.snia.org storage networking industry Association

22

23

23 AIX Logical Volume Manager LVM

• Volume Management creates a layer of abstraction over the storage. • Applications use a virtual storage, which is managed using a volume management software, a Logical Volume Manager (LVM). • LVM hides the details about where data is stored, on which actual hardware and where on that hardware, from the entire system. • Volume management lets you edit the storage configuration without actually changing anything on the hardware side, and vice versa. • By hiding the hardware details it completely separates hardware- and software storage management, so that it is possible to change the hardware side without the software ever noticing, all during runtime.

24

Advantages: Filesystems can span disks, so size is not limited by capacity of physical disks. Can expand filesystems on the fly. Can add additional disks to existing pool of disk space (VG). Can mirror important data on multiple physical disks for redundancy. Can "export" an entire VG so that all disks in the VG can be easily physically disconnected, moved to another machine, and "imported". Limitations: Must reduce vg when removing disk. When a single disk in a VG dies, the whole VG is affected. "Brick Wall" between VGs - LVs can't cross VG boundary. Cannot shrink filesystems.

24 Logical Volume Manager

25

25 AIX Logical Volume Manager LVM

• Physical Volume (PV) Synonym for "hard disk". A single physical hard drive. • Volume Group (VG) A set of one or more PVs which form a single storage pool. You can define multiple VGs on each AIX system. • Physical Partition (PP) The smallest allocation unit in the LVM. All PPs within a VG are the same size, usually 4 or 8 MB.

26

Logical Volume Manager The set of operating system commands, library subroutines, and other tools that allow you to establish and control logical volume storage is called the Logical Volume Manager (LVM). The LVM controls disk resources by mapping data between a more simple and flexible logical view of storage space and the actual physical disks. The LVM does this using a layer of device-driver code that runs above traditional disk device drivers. The LVM consists of the logical volume device driver (LVDD) and the LVM subroutine interface library. The logical volume device driver (LVDD) is a pseudo- device driver that manages and processes all I/O. It translates logical addresses into physical addresses and sends I/O requests to specific device drivers. The LVM subroutine interface library contains routines that are used by the system management commands to perform system management tasks for the logical and physical volumes of a system. For more information about how LVM works, see Understanding the Logical Volume Device Driver in AIX 5L Version 5.3 Kernel Extensions and Device Support Programming Concepts and Logical Volume Programming Overview in AIX 5L Version 5.3 General Programming Concepts: Writing and Debugging Programs. Parent topic: Logical volume storage

26 AIX Logical Volume Manager LVM

• Volume Group (VG) A set of one or more PVs which form a single storage pool. You can define multiple VGs on each AIX system. • Logical Partition (LP)A logical mapping to one or more PPs within the same VG, regardless of whether they're on the same PV or not. • Logical Volume (LV) A set of one or more LPs within the same VG which form a usable unit of disk space. LVs are used analogously to partitions on PCs or slices under Solaris: they usually contain filesystems or paging spaces ("swap").

27

Logical volume storage concepts The logical volume (which can span physical volumes) is composed of logical partitions allocated onto physical partitions. The following figure illustrates the relationships among the basic logical storage concepts. Figure 1. Volume Group. This illustration shows a volume group composed of three physical volumes with the maximum range specified. The logical volume (which can span physical volumes) is composed of logical partitions allocated onto physical partitions.

Physical volumes A disk must be designated as a physical volume and be put into an available state before it can be assigned to a volume group. Volume groups A volume group is a collection of 1 to 32 physical volumes of varying sizes and types. Physical partitions When you add a physical volume to a volume group, the physical volume is partitioned into contiguous, equal-sized units of space called physical partitions. A physical partition is the smallest unit of storage space allocation and is a contiguous space on a physical volume. Logical volumes After you create a volume group, you can create logical volumes within that volume group. Logical partitions When you create a logical volume, you specify the number of logical partitions for the logical volume. File systems The logical volume defines allocation of disk space down to the physical-partition level. Finer levels of data management are accomplished by higher-level software components such as the Virtual Memory Manager or the file system. Therefore, the final step in the evolution of a disk is the creation of file systems. Limitations for logical storage management The following table shows the limitations for logical storage management. LVM maintenance Instructions for additional maintenance tasks are located later in this section. Defining a raw logical volume for an application A raw logical volume is an area of physical and logical disk space that is under the direct control of an application, such as a database or a partition, rather than under the direct control of the operating system or a file system. Mirroring a volume group These scenarios explain how to mirror a normal volume group. Mirroring the root volume group The following scenario explains how to mirror the root volume group (rootvg). Unmirroring the root volume group You can unmirror the root volume group. Disk removal while the system remains available The following procedure describes how to remove a disk using the hot-removability feature, which lets you remove the disk without turning the system off. This feature is only available on certain systems.

27 AIX Logical Volume Manager LVM Journaled File System JFS JFS2 • A hierarchical structure of files and directories. • Journaled file systems are created on top of the LVM • File systems also contain, superblock, i-nodes, data blocks and allocation bitmaps. • Superblock maintains information about the filesystem. Size of file system, number of blocks in file system. A flag indicating state of file system.

28

The journaled file system (JFS) and the enhanced journaled file system (JFS2) are built into the base operating system. Both file system types link their file and directory data to the structure used by the AIX® Logical Volume Manager for storage and retrieval. A difference is that JFS2 is designed to accommodate a 64-bit kernel and larger files. The following sections describe these file systems. Unless otherwise noted, the following sections apply equally to JFS and JFS2 http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix .baseadmn/doc/baseadmndita/fs_jfs_jfs2.htm

28 AIX Logical Volume Manager LVM Journaled File System JFS JFS2 • Logical blocks contain a file or directory data. • Disk i-nodes, each file or directory has an i-node • I-nodes contain information such as, owner’s userid, number of links to file, access permissions for users and groups. Size of files, real disk addresses, last time file modified, last time file accessed.

29

29 Sample /etc/filesystems

/: dev = /dev/hd4 vol = "root" mount = automatic check = false free = true vfs = jfs2 log = /dev/hd8 type = bootfs

/home: dev = /dev/hd1 vol = "/home" mount = true check = true free = false vfs = jfs2 log = /dev/hd8

30

http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/IBM/usr/share/man/info/en_US/a_doc_lib/aixbman/admnc onc/lvm_overview.htm#HDRBA7C2CF562MART accountUsed by the dodisk command to determine the file systems to be processed by the accounting system. This value can be either the True or False value. bootUsed by the mkfs command to initialize the boot block of a new file system. This specifies the name of the load module to be placed into the first block of the file system. checkUsed by the fsck command to determine the default file systems to be checked. The True value enables checking while the False value disables checking. If a number, rather than the True value is specified, the file system is checked in the specified pass of checking. Multiple pass checking, described in the fsck command, permits file systems on different drives to be checked in parallel .devIdentifies, for local mounts, either the block special file where the file system resides or the file or directory to be mounted. System management utilities use this attribute to map file system names to the corresponding device names. For remote mounts, it identifies the file or directory to be mounted. freeThis value can be either true or false. Obsolete and ignored. mountUsed by the mount command to determine whether this file system should be mounted by default. The possible values of the mount attribute are: automatic Automatically mounts a file system when the system is started. Unlike the true value, filesystems which are mounted with the automatic value are not mounted with the mount all command or unmounted with the unmount all command. By default, the '/', '/usr', '/var', and '/tmp' filesystems use the automatic value. false This file system is not mounted by default. true This file system is mounted by the mount all command. It is unmounted by the unmount all command. The mount all command is issued during system initialization to mount automatically all such file systems. nodenameUsed by the mount command to determine which node contains the remote file system. If this attribute is not present, the mount is a local mount. The value of the nodename attribute should be a valid node nickname. This value can be overridden with the mount -n command.optionsComma-separated list of keywords that have meaning specific to a file system type. The options are passed to the file system at mount time. sizeUsed by the mkfs command for reference and to build the file system. The value is the number of 512-byte blocks in the file system. typeUsed to group related mounts. When the mount -t String command is issued, all of the currently unmounted file systems 30 with a type attribute equal to the String parameter are mounted. AIX Logical Volume Manager LVM Journaled File System JFS JFS2 • You can create one file system per logical volume. To create a file system, use the crfs command.

31

31 Database view of storage

• Databases and tablespaces are created on file systems. • Containers can be SMS or DMS. • Many parameters that control use of storage: prefetching, concurrent I/O, automatic storage, data compression, extent size, db2 striping.

32

32 Database view of storage: CIO

• A new file system feature called "Concurrent I/O" (CIO) was introduced in the Enhanced Journaling File System (JFS2) in AIX 5L™ version 5.2.0.10, also known as maintenance level 01 (announced May 27, 2003). • CIO provides similar performance advantages of raw logical volumes. • Use of Direct I/O is implicit with CIO • CIO does not do INODE locking

33 http://www3.software.ibm.com/ibmdl/pub/software/dw/dm/db2/dm-0408lee/CIO- article.pdf http://www-03.ibm.com/servers/aix/whitepapers/db_perf_aix.pdf

33 Database view of storage: CIO

34

34 Database view of storage: CIO

35

35 Database view of storage: CIO

36

36 Database view of storage: CIO

• Buffered I/O is the default when creating a tablespace • To use CIO, must be JFS2 and NO FILE SYSTEM CACHING specified for tablespace. Either at tablespace creation or with ALTER TABLESPACE command. • IF CIO enabled and CIO is not supported at system level for example JFS. Then DIO will be used.

37

37 Database view of storage: Automatic Storage • Tablespaces can be created without specifying the containers. DB2 takes care of . • Database can only be enabled for Automatic storage at creation time. Can not enable for a database not created with it enabled. • Can’t disable automatic storage for a database created with Automatic storage. • Available in version 8.2.2 8.1 fixpak 9

38 ftp://ftp.software.ibm.com/software/data/db2/9/labchats/20061114-slides.pdf

38 Database view of storage: Automatic Storage • AUTORESIZE YES Allows DMS to auto grow

ALTER TABLESPACE ts1 AUTORESIZE YES INCREASESIZE integer PERCENT or INCREASESIZE integer K | M | G MAXSIZE integer K | M | G or MAXSIZE NONE

CREATE TABLESPACE ts1 MANAGED BY AUTOMATIC STORAGE INITIALSIZE x K|M|G INCREASESIZE x PERCENT or x K|M|G MAXSIZE x K|M|G or NONE

39

SMS – automatically grow, RAW , must increase at OS Level to growl Only DMS can be automatically resized or increased http://blogs.ittoolbox.com/database/technology/archives/new-automatic-storage-in- 822-4720

39 Database view of storage: Compression • Version 9 DB2 LUW provides row compression • V 8 provides backup compression • V8 provides MDC which reduces size of indexes because of block type indexes • Row compression looks at entire contents of table and looks for repeating strings and stores in a dictionary and replaces with a 12 bit symbol that represents the data stored in the dictionary

40

40 Database view of storage: Prefetching

• Prefetching means reading data in from disks and into the buffer pool • List Prefetch – a set of non-sequential data pages and in using a index and RIDS to fetch rows. • Sequential Prefetch – reading consecutive data pages in to the buffer pool before they are needed. • Block Based Buffer Pools in version 8. Contiguous pages from disk can be read into contiguous pages of bufferpool

41

41 Database view of storage: Extents

• Extent size is the number of pages of data that will be written • Extents are sliced across all containers. For DMS a extent is written to the first container, then the second container and so forth. • SMS used multi-page file allocation to write by extents instead of a page at a time

42

Tablespace maps are used to map dms storage allocation

Very good information on tablespace maps on infocenter

42 Database view of storage: Extents

43

43 Database view of storage: Extents

44

44 DPF and storage

• Share nothing environment • Parallelism is the key to performance • Separate partitions on separate disk arrays. • Each partition has logs, temp space, tablespaces, indexes, buffer pools. • Consider using BCU (Balanced Configuration Unit) or BCU model for designing database

45

45 DPF and storage

• A good partition key is essential. Even distribution of data is desired. • DB2ADVIS will recommend partition keys • Need to periodically check the distribution of data.

46

46 Backup and Restore Considerations

• Backup to disk or to backup directly to tape. • Considering using TSM or something similar like Veritas. • DB2ADUTL works directly with TSM and very integrated. • Can have multiple output paths on a backup to increase performance. • Consider setting parallelism to a higher number to backup more than one tablespace at a time.

47

47 Backup and Restore Considerations

• Consider using include logs to keep needed logs with backup. • Archive logs must be managed. Develop a script to delete old logs. • Backup config parms and use db2look to save DDL.

48

48 LINKS

• http://publib.boulder.ibm.com/infocenter/pseries/v5 r3/index.jsp • http://blogs.ittoolbox.com/database/technology/ • http://www.db2mag.com/

49

49 Session: A12 Space: The Storage Frontier for Distributed DB2 DBA’s

Jerry Spence National City Corporation [email protected]

50

50